Data Science Project title of dataset: “Kepler Exoplanet Search Results” source: Kaggle - https://www.kaggle.com/nasa/kepler-exoplanet-search-results authors: Matthew Bazzo, Soo Hyung Choe, Shiming Yan, Alex Zhang

Introduction

The following dataset features 9564 observations, and 50 variables (12 categorical and 38 numerical), totally over 478000 data points. Some of the key variables are: disposition, (which tells us the status of the variable, confirmed, false positive, or candidate) koi score, 4 koi flags (4 tests to determine validity of planet, planetary data, and features of the star the planet revolves around. The dataset is generated by the Kepler telescope, which observes the data by looking at a star, and measuring the changes in brightness as an object moves between the star and the telescope.

We thought of many different questions and performed supervised and unsupervised learning to the dataset in order to find a solution to our questions.

“The Data Science Process is about observation, model building, analysis and conclusion” We thoroughly followed the data science process as shown below: 1. Ask questions and identify the problem 2. Data Collection 3. Data Exploration 4. Data Modeling 5. Data Analysis 6. Visualization and Presentation of Result

Step 1: Ask questions and identify the problem

After looking at the data, we thought of many different questions and the problems we wanted to tackle. We seperated them into three categories depending on the technique that could be applied to the problem: exploratory data analysis (EDA), supervised learning, and unsuperised learning.

Exploratory Data Analysis

These questions are answered by basic visualizations and/or descriptive statistics.

  1. Are binary stars more likely to host planets?
  2. What are the feature distributions of likely habitable planets?
  3. What does sky-projected distance represent?
  4. What do the stars of Earth-like planets look like, and how do they compare to our sun?
  5. Do the Earth-like planets congregate within certain patches of the night sky?

Supervised Learning

These questions invoke the use of supervised learning techniques to develop predicitve models.

  1. Can we determine a classification system for exoplanet candidates (koi_disposition)?

Unsupervised Learning

These questions, or problems, invoke the use of unsupervised learning techniques to devise labels for observations.

  1. K-Means Clustering: Planet Categorization

Step 2: Some Quick Visualizations and Exploratory Analysis

Exploratory Data Analysis

starData <- read.csv("cumulative.csv", header = TRUE, na.strings = "")
kepler_df <- starData
head(starData)
##   rowid    kepid kepoi_name  kepler_name koi_disposition koi_pdisposition
## 1     1 10797460  K00752.01 Kepler-227 b       CONFIRMED        CANDIDATE
## 2     2 10797460  K00752.02 Kepler-227 c       CONFIRMED        CANDIDATE
## 3     3 10811496  K00753.01         <NA>  FALSE POSITIVE   FALSE POSITIVE
## 4     4 10848459  K00754.01         <NA>  FALSE POSITIVE   FALSE POSITIVE
## 5     5 10854555  K00755.01 Kepler-664 b       CONFIRMED        CANDIDATE
## 6     6 10872983  K00756.01 Kepler-228 d       CONFIRMED        CANDIDATE
##   koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1     1.000             0             0             0             0
## 2     0.969             0             0             0             0
## 3     0.000             0             1             0             0
## 4     0.000             0             1             0             0
## 5     1.000             0             0             0             0
## 6     1.000             0             0             0             0
##   koi_period koi_period_err1 koi_period_err2 koi_time0bk koi_time0bk_err1
## 1   9.488036       2.775e-05      -2.775e-05    170.5387         0.002160
## 2  54.418383       2.479e-04      -2.479e-04    162.5138         0.003520
## 3  19.899140       1.494e-05      -1.494e-05    175.8503         0.000581
## 4   1.736952       2.630e-07      -2.630e-07    170.3076         0.000115
## 5   2.525592       3.761e-06      -3.761e-06    171.5956         0.001130
## 6  11.094321       2.036e-05      -2.036e-05    171.2012         0.001410
##   koi_time0bk_err2 koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1        -0.002160      0.146           0.318          -0.146      2.95750
## 2        -0.003520      0.586           0.059          -0.443      4.50700
## 3        -0.000581      0.969           5.126          -0.077      1.78220
## 4        -0.000115      1.276           0.115          -0.092      2.40641
## 5        -0.001130      0.701           0.235          -0.478      1.65450
## 6        -0.001410      0.538           0.030          -0.428      4.59450
##   koi_duration_err1 koi_duration_err2 koi_depth koi_depth_err1
## 1           0.08190          -0.08190     615.8           19.5
## 2           0.11600          -0.11600     874.8           35.5
## 3           0.03410          -0.03410   10829.0          171.0
## 4           0.00537          -0.00537    8079.2           12.8
## 5           0.04200          -0.04200     603.3           16.9
## 6           0.06100          -0.06100    1517.5           24.2
##   koi_depth_err2 koi_prad koi_prad_err1 koi_prad_err2 koi_teq koi_teq_err1
## 1          -19.5     2.26          0.26         -0.15     793           NA
## 2          -35.5     2.83          0.32         -0.19     443           NA
## 3         -171.0    14.60          3.92         -1.31     638           NA
## 4          -12.8    33.46          8.50         -2.83    1395           NA
## 5          -16.9     2.75          0.88         -0.35    1406           NA
## 6          -24.2     3.90          1.27         -0.42     835           NA
##   koi_teq_err2 koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr
## 1           NA     93.59          29.45         -16.65          35.8
## 2           NA      9.11           2.87          -1.62          25.8
## 3           NA     39.30          31.04         -10.49          76.3
## 4           NA    891.96         668.95        -230.35         505.6
## 5           NA    926.16         874.33        -314.24          40.9
## 6           NA    114.81         112.85         -36.70          66.5
##   koi_tce_plnt_num koi_tce_delivname koi_steff koi_steff_err1
## 1                1   q1_q17_dr25_tce      5455             81
## 2                2   q1_q17_dr25_tce      5455             81
## 3                1   q1_q17_dr25_tce      5853            158
## 4                1   q1_q17_dr25_tce      5805            157
## 5                1   q1_q17_dr25_tce      6031            169
## 6                1   q1_q17_dr25_tce      6046            189
##   koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## 1            -81     4.467          0.064         -0.096    0.927
## 2            -81     4.467          0.064         -0.096    0.927
## 3           -176     4.544          0.044         -0.176    0.868
## 4           -174     4.564          0.053         -0.168    0.791
## 5           -211     4.438          0.070         -0.210    1.046
## 6           -232     4.486          0.054         -0.229    0.972
##   koi_srad_err1 koi_srad_err2       ra      dec koi_kepmag
## 1         0.105        -0.061 291.9342 48.14165     15.347
## 2         0.105        -0.061 291.9342 48.14165     15.347
## 3         0.233        -0.078 297.0048 48.13413     15.436
## 4         0.201        -0.067 285.5346 48.28521     15.597
## 5         0.334        -0.133 288.7549 48.22620     15.509
## 6         0.315        -0.105 296.2861 48.22467     15.714

Load in all the required libraries.

library(Amelia)
## Warning: package 'Amelia' was built under R version 3.4.4
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.4
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.4
library(rpart)
library(rpart.plot)
library(corrplot)
## corrplot 0.84 loaded
library(plotly)
## Warning: package 'plotly' was built under R version 3.4.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(rpart)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(class)
library(e1071)
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.4.4

First, we analyzed the relationship between koi_disposition and koi_pdisposition and found the similarities/differences.

levels(starData$koi_disposition)
## [1] "CANDIDATE"      "CONFIRMED"      "FALSE POSITIVE"
levels(starData$koi_pdisposition)
## [1] "CANDIDATE"      "FALSE POSITIVE"

koi_disposition has two categories and koi_pdisposition has three categories. The two dispositions share two of the same categories,“CANDIDATE” & “FALSE POSITIVE” as koi_disposition has an additional category named “CONFIRMED”

Data1 <- subset(starData, koi_disposition =="CONFIRMED" & koi_pdisposition =="CANDIDATE")
Data2 <- subset(starData, koi_disposition =="CONFIRMED" & koi_pdisposition =="FALSE POSITIVE")
Data3 <- subset(starData, koi_disposition =="CANDIDATE" & koi_pdisposition =="CANDIDATE")
Data4 <- subset(starData, koi_disposition =="CANDIDATE" & koi_pdisposition =="FALSE POSITIVE")
Data5 <- subset(starData, koi_disposition =="FALSE POSITIVE" & koi_pdisposition =="CANDIDATE")
Data6 <- subset(starData, koi_disposition =="FALSE POSITIVE" & koi_pdisposition =="FALSE POSITIVE")

Out of 9564 rows, When koi_disposition is classified as Confirmed -> koi_pdisposition classified the instances as Candidate 2248 times -> koi_pdisposition classified the instances as False Positive 45 times -> misclassification rate of 2%

When koi_disposition is classified as Candidate -> koi_pdisposition classified the instances as Candidate 2248 times -> koi_pdisposition classified the instances as False Positive 0 times -> misclassification rate of 0%

When koi_disposition is classified as False Positive -> koi_pdisposition classified the instances as Candidate 0 times -> koi_pdisposition classified the instances as False Positive 5023 times -> misclassification rate of 0%

As seen in our analysis, the koi_disposition and koi_pdisposition are very similar. There are small discrepancies which are only present when koi_disposition is classified as Confirmed.

Plotting histogram of the koi_score to get a better understanding of the koi_score

hist(starData$koi_score)

As seen in the histogram above, the most frequent scores are located at 0 and 1. The other koi_scores make up a small percentage of the results and can be rounded to the nearest integer: either 0 or 1.

Analyzing the different column names of the Kepler dataset

titleLabels <- names(starData)
titleLabels
##  [1] "rowid"             "kepid"             "kepoi_name"       
##  [4] "kepler_name"       "koi_disposition"   "koi_pdisposition" 
##  [7] "koi_score"         "koi_fpflag_nt"     "koi_fpflag_ss"    
## [10] "koi_fpflag_co"     "koi_fpflag_ec"     "koi_period"       
## [13] "koi_period_err1"   "koi_period_err2"   "koi_time0bk"      
## [16] "koi_time0bk_err1"  "koi_time0bk_err2"  "koi_impact"       
## [19] "koi_impact_err1"   "koi_impact_err2"   "koi_duration"     
## [22] "koi_duration_err1" "koi_duration_err2" "koi_depth"        
## [25] "koi_depth_err1"    "koi_depth_err2"    "koi_prad"         
## [28] "koi_prad_err1"     "koi_prad_err2"     "koi_teq"          
## [31] "koi_teq_err1"      "koi_teq_err2"      "koi_insol"        
## [34] "koi_insol_err1"    "koi_insol_err2"    "koi_model_snr"    
## [37] "koi_tce_plnt_num"  "koi_tce_delivname" "koi_steff"        
## [40] "koi_steff_err1"    "koi_steff_err2"    "koi_slogg"        
## [43] "koi_slogg_err1"    "koi_slogg_err2"    "koi_srad"         
## [46] "koi_srad_err1"     "koi_srad_err2"     "ra"               
## [49] "dec"               "koi_kepmag"

First, obtain a visual of the blank/missing/empty data:

head(kepler_df)
##   rowid    kepid kepoi_name  kepler_name koi_disposition koi_pdisposition
## 1     1 10797460  K00752.01 Kepler-227 b       CONFIRMED        CANDIDATE
## 2     2 10797460  K00752.02 Kepler-227 c       CONFIRMED        CANDIDATE
## 3     3 10811496  K00753.01         <NA>  FALSE POSITIVE   FALSE POSITIVE
## 4     4 10848459  K00754.01         <NA>  FALSE POSITIVE   FALSE POSITIVE
## 5     5 10854555  K00755.01 Kepler-664 b       CONFIRMED        CANDIDATE
## 6     6 10872983  K00756.01 Kepler-228 d       CONFIRMED        CANDIDATE
##   koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1     1.000             0             0             0             0
## 2     0.969             0             0             0             0
## 3     0.000             0             1             0             0
## 4     0.000             0             1             0             0
## 5     1.000             0             0             0             0
## 6     1.000             0             0             0             0
##   koi_period koi_period_err1 koi_period_err2 koi_time0bk koi_time0bk_err1
## 1   9.488036       2.775e-05      -2.775e-05    170.5387         0.002160
## 2  54.418383       2.479e-04      -2.479e-04    162.5138         0.003520
## 3  19.899140       1.494e-05      -1.494e-05    175.8503         0.000581
## 4   1.736952       2.630e-07      -2.630e-07    170.3076         0.000115
## 5   2.525592       3.761e-06      -3.761e-06    171.5956         0.001130
## 6  11.094321       2.036e-05      -2.036e-05    171.2012         0.001410
##   koi_time0bk_err2 koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1        -0.002160      0.146           0.318          -0.146      2.95750
## 2        -0.003520      0.586           0.059          -0.443      4.50700
## 3        -0.000581      0.969           5.126          -0.077      1.78220
## 4        -0.000115      1.276           0.115          -0.092      2.40641
## 5        -0.001130      0.701           0.235          -0.478      1.65450
## 6        -0.001410      0.538           0.030          -0.428      4.59450
##   koi_duration_err1 koi_duration_err2 koi_depth koi_depth_err1
## 1           0.08190          -0.08190     615.8           19.5
## 2           0.11600          -0.11600     874.8           35.5
## 3           0.03410          -0.03410   10829.0          171.0
## 4           0.00537          -0.00537    8079.2           12.8
## 5           0.04200          -0.04200     603.3           16.9
## 6           0.06100          -0.06100    1517.5           24.2
##   koi_depth_err2 koi_prad koi_prad_err1 koi_prad_err2 koi_teq koi_teq_err1
## 1          -19.5     2.26          0.26         -0.15     793           NA
## 2          -35.5     2.83          0.32         -0.19     443           NA
## 3         -171.0    14.60          3.92         -1.31     638           NA
## 4          -12.8    33.46          8.50         -2.83    1395           NA
## 5          -16.9     2.75          0.88         -0.35    1406           NA
## 6          -24.2     3.90          1.27         -0.42     835           NA
##   koi_teq_err2 koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr
## 1           NA     93.59          29.45         -16.65          35.8
## 2           NA      9.11           2.87          -1.62          25.8
## 3           NA     39.30          31.04         -10.49          76.3
## 4           NA    891.96         668.95        -230.35         505.6
## 5           NA    926.16         874.33        -314.24          40.9
## 6           NA    114.81         112.85         -36.70          66.5
##   koi_tce_plnt_num koi_tce_delivname koi_steff koi_steff_err1
## 1                1   q1_q17_dr25_tce      5455             81
## 2                2   q1_q17_dr25_tce      5455             81
## 3                1   q1_q17_dr25_tce      5853            158
## 4                1   q1_q17_dr25_tce      5805            157
## 5                1   q1_q17_dr25_tce      6031            169
## 6                1   q1_q17_dr25_tce      6046            189
##   koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## 1            -81     4.467          0.064         -0.096    0.927
## 2            -81     4.467          0.064         -0.096    0.927
## 3           -176     4.544          0.044         -0.176    0.868
## 4           -174     4.564          0.053         -0.168    0.791
## 5           -211     4.438          0.070         -0.210    1.046
## 6           -232     4.486          0.054         -0.229    0.972
##   koi_srad_err1 koi_srad_err2       ra      dec koi_kepmag
## 1         0.105        -0.061 291.9342 48.14165     15.347
## 2         0.105        -0.061 291.9342 48.14165     15.347
## 3         0.233        -0.078 297.0048 48.13413     15.436
## 4         0.201        -0.067 285.5346 48.28521     15.597
## 5         0.334        -0.133 288.7549 48.22620     15.509
## 6         0.315        -0.105 296.2861 48.22467     15.714
summary(kepler_df)
##      rowid          kepid              kepoi_name         kepler_name  
##  Min.   :   1   Min.   :  757450   K00001.01:   1   Kepler-1 b  :   1  
##  1st Qu.:2392   1st Qu.: 5556034   K00002.01:   1   Kepler-10 b :   1  
##  Median :4782   Median : 7906892   K00003.01:   1   Kepler-10 c :   1  
##  Mean   :4782   Mean   : 7690628   K00004.01:   1   Kepler-100 b:   1  
##  3rd Qu.:7173   3rd Qu.: 9873066   K00005.01:   1   Kepler-100 c:   1  
##  Max.   :9564   Max.   :12935144   K00005.02:   1   (Other)     :2289  
##                                    (Other)  :9558   NA's        :7270  
##        koi_disposition       koi_pdisposition   koi_score     
##  CANDIDATE     :2248   CANDIDATE     :4496    Min.   :0.0000  
##  CONFIRMED     :2293   FALSE POSITIVE:5068    1st Qu.:0.0000  
##  FALSE POSITIVE:5023                          Median :0.3340  
##                                               Mean   :0.4808  
##                                               3rd Qu.:0.9980  
##                                               Max.   :1.0000  
##                                               NA's   :1510    
##  koi_fpflag_nt    koi_fpflag_ss    koi_fpflag_co    koi_fpflag_ec 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00  
##  Mean   :0.1882   Mean   :0.2316   Mean   :0.1949   Mean   :0.12  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00  
##                                                                   
##    koi_period        koi_period_err1  koi_period_err2    koi_time0bk    
##  Min.   :     0.24   Min.   :0.0000   Min.   :-0.1725   Min.   : 120.5  
##  1st Qu.:     2.73   1st Qu.:0.0000   1st Qu.:-0.0003   1st Qu.: 132.8  
##  Median :     9.75   Median :0.0000   Median : 0.0000   Median : 137.2  
##  Mean   :    75.67   Mean   :0.0021   Mean   :-0.0021   Mean   : 166.2  
##  3rd Qu.:    40.72   3rd Qu.:0.0003   3rd Qu.: 0.0000   3rd Qu.: 170.7  
##  Max.   :129995.78   Max.   :0.1725   Max.   : 0.0000   Max.   :1472.5  
##                      NA's   :454      NA's   :454                       
##  koi_time0bk_err1 koi_time0bk_err2    koi_impact       koi_impact_err1 
##  Min.   :0.0000   Min.   :-0.5690   Min.   :  0.0000   Min.   : 0.000  
##  1st Qu.:0.0012   1st Qu.:-0.0105   1st Qu.:  0.1970   1st Qu.: 0.040  
##  Median :0.0041   Median :-0.0041   Median :  0.5370   Median : 0.193  
##  Mean   :0.0099   Mean   :-0.0099   Mean   :  0.7351   Mean   : 1.960  
##  3rd Qu.:0.0105   3rd Qu.:-0.0012   3rd Qu.:  0.8890   3rd Qu.: 0.378  
##  Max.   :0.5690   Max.   : 0.0000   Max.   :100.8060   Max.   :85.540  
##  NA's   :454      NA's   :454       NA's   :363        NA's   :454     
##  koi_impact_err2     koi_duration     koi_duration_err1 koi_duration_err2 
##  Min.   :-59.3200   Min.   :  0.052   Min.   : 0.0000   Min.   :-20.2000  
##  1st Qu.: -0.4450   1st Qu.:  2.438   1st Qu.: 0.0508   1st Qu.: -0.3500  
##  Median : -0.2070   Median :  3.793   Median : 0.1420   Median : -0.1420  
##  Mean   : -0.3326   Mean   :  5.622   Mean   : 0.3399   Mean   : -0.3399  
##  3rd Qu.: -0.0460   3rd Qu.:  6.277   3rd Qu.: 0.3500   3rd Qu.: -0.0508  
##  Max.   :  0.0000   Max.   :138.540   Max.   :20.2000   Max.   :  0.0000  
##  NA's   :454                          NA's   :454       NA's   :454       
##    koi_depth         koi_depth_err1     koi_depth_err2     
##  Min.   :      0.0   Min.   :     0.0   Min.   :-388600.0  
##  1st Qu.:    159.9   1st Qu.:     9.6   1st Qu.:    -49.5  
##  Median :    421.1   Median :    20.8   Median :    -20.8  
##  Mean   :  23791.3   Mean   :   123.2   Mean   :   -123.2  
##  3rd Qu.:   1473.4   3rd Qu.:    49.5   3rd Qu.:     -9.6  
##  Max.   :1541400.0   Max.   :388600.0   Max.   :      0.0  
##  NA's   :363         NA's   :454        NA's   :454        
##     koi_prad         koi_prad_err1      koi_prad_err2      
##  Min.   :     0.08   Min.   :    0.00   Min.   :-77180.00  
##  1st Qu.:     1.40   1st Qu.:    0.23   1st Qu.:    -1.94  
##  Median :     2.39   Median :    0.52   Median :    -0.30  
##  Mean   :   102.89   Mean   :   17.66   Mean   :   -33.02  
##  3rd Qu.:    14.93   3rd Qu.:    2.32   3rd Qu.:    -0.14  
##  Max.   :200346.00   Max.   :21640.00   Max.   :     0.00  
##  NA's   :363         NA's   :363        NA's   :363        
##     koi_teq      koi_teq_err1   koi_teq_err2     koi_insol       
##  Min.   :   25   Mode:logical   Mode:logical   Min.   :       0  
##  1st Qu.:  539   NA's:9564      NA's:9564      1st Qu.:      20  
##  Median :  878                                 Median :     142  
##  Mean   : 1085                                 Mean   :    7746  
##  3rd Qu.: 1379                                 3rd Qu.:     870  
##  Max.   :14667                                 Max.   :10947555  
##  NA's   :363                                   NA's   :321       
##  koi_insol_err1    koi_insol_err2     koi_model_snr    koi_tce_plnt_num
##  Min.   :      0   Min.   :-5600031   Min.   :   0.0   Min.   :1.000   
##  1st Qu.:      9   1st Qu.:    -287   1st Qu.:  12.0   1st Qu.:1.000   
##  Median :     73   Median :     -40   Median :  23.0   Median :1.000   
##  Mean   :   3751   Mean   :   -4044   Mean   : 259.9   Mean   :1.244   
##  3rd Qu.:    519   3rd Qu.:      -5   3rd Qu.:  78.0   3rd Qu.:1.000   
##  Max.   :3617133   Max.   :       0   Max.   :9054.7   Max.   :8.000   
##  NA's   :321       NA's   :321        NA's   :363      NA's   :346     
##        koi_tce_delivname   koi_steff     koi_steff_err1  koi_steff_err2   
##  q1_q16_tce     : 796    Min.   : 2661   Min.   :  0.0   Min.   :-1762.0  
##  q1_q17_dr24_tce: 368    1st Qu.: 5310   1st Qu.:106.0   1st Qu.: -198.0  
##  q1_q17_dr25_tce:8054    Median : 5767   Median :157.0   Median : -160.0  
##  NA's           : 346    Mean   : 5707   Mean   :144.6   Mean   : -162.3  
##                          3rd Qu.: 6112   3rd Qu.:174.0   3rd Qu.: -114.0  
##                          Max.   :15896   Max.   :676.0   Max.   :    0.0  
##                          NA's   :363     NA's   :468     NA's   :483      
##    koi_slogg     koi_slogg_err1   koi_slogg_err2       koi_srad      
##  Min.   :0.047   Min.   :0.0000   Min.   :-1.2070   Min.   :  0.109  
##  1st Qu.:4.218   1st Qu.:0.0420   1st Qu.:-0.1960   1st Qu.:  0.829  
##  Median :4.438   Median :0.0700   Median :-0.1280   Median :  1.000  
##  Mean   :4.310   Mean   :0.1207   Mean   :-0.1432   Mean   :  1.729  
##  3rd Qu.:4.543   3rd Qu.:0.1490   3rd Qu.:-0.0880   3rd Qu.:  1.345  
##  Max.   :5.364   Max.   :1.4720   Max.   : 0.0000   Max.   :229.908  
##  NA's   :363     NA's   :468      NA's   :468       NA's   :363      
##  koi_srad_err1     koi_srad_err2             ra             dec       
##  Min.   : 0.0000   Min.   :-116.1370   Min.   :279.9   Min.   :36.58  
##  1st Qu.: 0.1290   1st Qu.:  -0.2500   1st Qu.:288.7   1st Qu.:40.78  
##  Median : 0.2510   Median :  -0.1110   Median :292.3   Median :43.68  
##  Mean   : 0.3623   Mean   :  -0.3948   Mean   :292.1   Mean   :43.81  
##  3rd Qu.: 0.3640   3rd Qu.:  -0.0690   3rd Qu.:295.9   3rd Qu.:46.71  
##  Max.   :33.0910   Max.   :   0.0000   Max.   :301.7   Max.   :52.34  
##  NA's   :468       NA's   :468                                        
##    koi_kepmag    
##  Min.   : 6.966  
##  1st Qu.:13.440  
##  Median :14.520  
##  Mean   :14.265  
##  3rd Qu.:15.322  
##  Max.   :20.003  
##  NA's   :1
missmap(kepler_df)

Let’s get rid of the error measurements that are fully blank. Let’s also get rid of some obviously useless features:

kepler_df$koi_teq_err1 <- NULL
kepler_df$koi_teq_err2 <- NULL
kepler_df$rowid <- NULL
kepler_df$kepid <- NULL

We should note that koi_teq_err1 and koi_teq_err2, which were deleted, quantify the error margin for the effective temperature for planets.

There are still a fair number of “NA” values remaining, but the data frame is workable from here.

missmap(kepler_df)

Let’s generate some summary stats:

summary(kepler_df)
##      kepoi_name         kepler_name         koi_disposition
##  K00001.01:   1   Kepler-1 b  :   1   CANDIDATE     :2248  
##  K00002.01:   1   Kepler-10 b :   1   CONFIRMED     :2293  
##  K00003.01:   1   Kepler-10 c :   1   FALSE POSITIVE:5023  
##  K00004.01:   1   Kepler-100 b:   1                        
##  K00005.01:   1   Kepler-100 c:   1                        
##  K00005.02:   1   (Other)     :2289                        
##  (Other)  :9558   NA's        :7270                        
##        koi_pdisposition   koi_score      koi_fpflag_nt    koi_fpflag_ss   
##  CANDIDATE     :4496    Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  FALSE POSITIVE:5068    1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##                         Median :0.3340   Median :0.0000   Median :0.0000  
##                         Mean   :0.4808   Mean   :0.1882   Mean   :0.2316  
##                         3rd Qu.:0.9980   3rd Qu.:0.0000   3rd Qu.:0.0000  
##                         Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                         NA's   :1510                                      
##  koi_fpflag_co    koi_fpflag_ec    koi_period        koi_period_err1 
##  Min.   :0.0000   Min.   :0.00   Min.   :     0.24   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:     2.73   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00   Median :     9.75   Median :0.0000  
##  Mean   :0.1949   Mean   :0.12   Mean   :    75.67   Mean   :0.0021  
##  3rd Qu.:0.0000   3rd Qu.:0.00   3rd Qu.:    40.72   3rd Qu.:0.0003  
##  Max.   :1.0000   Max.   :1.00   Max.   :129995.78   Max.   :0.1725  
##                                                      NA's   :454     
##  koi_period_err2    koi_time0bk     koi_time0bk_err1 koi_time0bk_err2 
##  Min.   :-0.1725   Min.   : 120.5   Min.   :0.0000   Min.   :-0.5690  
##  1st Qu.:-0.0003   1st Qu.: 132.8   1st Qu.:0.0012   1st Qu.:-0.0105  
##  Median : 0.0000   Median : 137.2   Median :0.0041   Median :-0.0041  
##  Mean   :-0.0021   Mean   : 166.2   Mean   :0.0099   Mean   :-0.0099  
##  3rd Qu.: 0.0000   3rd Qu.: 170.7   3rd Qu.:0.0105   3rd Qu.:-0.0012  
##  Max.   : 0.0000   Max.   :1472.5   Max.   :0.5690   Max.   : 0.0000  
##  NA's   :454                        NA's   :454      NA's   :454      
##    koi_impact       koi_impact_err1  koi_impact_err2     koi_duration    
##  Min.   :  0.0000   Min.   : 0.000   Min.   :-59.3200   Min.   :  0.052  
##  1st Qu.:  0.1970   1st Qu.: 0.040   1st Qu.: -0.4450   1st Qu.:  2.438  
##  Median :  0.5370   Median : 0.193   Median : -0.2070   Median :  3.793  
##  Mean   :  0.7351   Mean   : 1.960   Mean   : -0.3326   Mean   :  5.622  
##  3rd Qu.:  0.8890   3rd Qu.: 0.378   3rd Qu.: -0.0460   3rd Qu.:  6.277  
##  Max.   :100.8060   Max.   :85.540   Max.   :  0.0000   Max.   :138.540  
##  NA's   :363        NA's   :454      NA's   :454                         
##  koi_duration_err1 koi_duration_err2    koi_depth        
##  Min.   : 0.0000   Min.   :-20.2000   Min.   :      0.0  
##  1st Qu.: 0.0508   1st Qu.: -0.3500   1st Qu.:    159.9  
##  Median : 0.1420   Median : -0.1420   Median :    421.1  
##  Mean   : 0.3399   Mean   : -0.3399   Mean   :  23791.3  
##  3rd Qu.: 0.3500   3rd Qu.: -0.0508   3rd Qu.:   1473.4  
##  Max.   :20.2000   Max.   :  0.0000   Max.   :1541400.0  
##  NA's   :454       NA's   :454        NA's   :363        
##  koi_depth_err1     koi_depth_err2         koi_prad        
##  Min.   :     0.0   Min.   :-388600.0   Min.   :     0.08  
##  1st Qu.:     9.6   1st Qu.:    -49.5   1st Qu.:     1.40  
##  Median :    20.8   Median :    -20.8   Median :     2.39  
##  Mean   :   123.2   Mean   :   -123.2   Mean   :   102.89  
##  3rd Qu.:    49.5   3rd Qu.:     -9.6   3rd Qu.:    14.93  
##  Max.   :388600.0   Max.   :      0.0   Max.   :200346.00  
##  NA's   :454        NA's   :454         NA's   :363        
##  koi_prad_err1      koi_prad_err2          koi_teq        koi_insol       
##  Min.   :    0.00   Min.   :-77180.00   Min.   :   25   Min.   :       0  
##  1st Qu.:    0.23   1st Qu.:    -1.94   1st Qu.:  539   1st Qu.:      20  
##  Median :    0.52   Median :    -0.30   Median :  878   Median :     142  
##  Mean   :   17.66   Mean   :   -33.02   Mean   : 1085   Mean   :    7746  
##  3rd Qu.:    2.32   3rd Qu.:    -0.14   3rd Qu.: 1379   3rd Qu.:     870  
##  Max.   :21640.00   Max.   :     0.00   Max.   :14667   Max.   :10947555  
##  NA's   :363        NA's   :363         NA's   :363     NA's   :321       
##  koi_insol_err1    koi_insol_err2     koi_model_snr    koi_tce_plnt_num
##  Min.   :      0   Min.   :-5600031   Min.   :   0.0   Min.   :1.000   
##  1st Qu.:      9   1st Qu.:    -287   1st Qu.:  12.0   1st Qu.:1.000   
##  Median :     73   Median :     -40   Median :  23.0   Median :1.000   
##  Mean   :   3751   Mean   :   -4044   Mean   : 259.9   Mean   :1.244   
##  3rd Qu.:    519   3rd Qu.:      -5   3rd Qu.:  78.0   3rd Qu.:1.000   
##  Max.   :3617133   Max.   :       0   Max.   :9054.7   Max.   :8.000   
##  NA's   :321       NA's   :321        NA's   :363      NA's   :346     
##        koi_tce_delivname   koi_steff     koi_steff_err1  koi_steff_err2   
##  q1_q16_tce     : 796    Min.   : 2661   Min.   :  0.0   Min.   :-1762.0  
##  q1_q17_dr24_tce: 368    1st Qu.: 5310   1st Qu.:106.0   1st Qu.: -198.0  
##  q1_q17_dr25_tce:8054    Median : 5767   Median :157.0   Median : -160.0  
##  NA's           : 346    Mean   : 5707   Mean   :144.6   Mean   : -162.3  
##                          3rd Qu.: 6112   3rd Qu.:174.0   3rd Qu.: -114.0  
##                          Max.   :15896   Max.   :676.0   Max.   :    0.0  
##                          NA's   :363     NA's   :468     NA's   :483      
##    koi_slogg     koi_slogg_err1   koi_slogg_err2       koi_srad      
##  Min.   :0.047   Min.   :0.0000   Min.   :-1.2070   Min.   :  0.109  
##  1st Qu.:4.218   1st Qu.:0.0420   1st Qu.:-0.1960   1st Qu.:  0.829  
##  Median :4.438   Median :0.0700   Median :-0.1280   Median :  1.000  
##  Mean   :4.310   Mean   :0.1207   Mean   :-0.1432   Mean   :  1.729  
##  3rd Qu.:4.543   3rd Qu.:0.1490   3rd Qu.:-0.0880   3rd Qu.:  1.345  
##  Max.   :5.364   Max.   :1.4720   Max.   : 0.0000   Max.   :229.908  
##  NA's   :363     NA's   :468      NA's   :468       NA's   :363      
##  koi_srad_err1     koi_srad_err2             ra             dec       
##  Min.   : 0.0000   Min.   :-116.1370   Min.   :279.9   Min.   :36.58  
##  1st Qu.: 0.1290   1st Qu.:  -0.2500   1st Qu.:288.7   1st Qu.:40.78  
##  Median : 0.2510   Median :  -0.1110   Median :292.3   Median :43.68  
##  Mean   : 0.3623   Mean   :  -0.3948   Mean   :292.1   Mean   :43.81  
##  3rd Qu.: 0.3640   3rd Qu.:  -0.0690   3rd Qu.:295.9   3rd Qu.:46.71  
##  Max.   :33.0910   Max.   :   0.0000   Max.   :301.7   Max.   :52.34  
##  NA's   :468       NA's   :468                                        
##    koi_kepmag    
##  Min.   : 6.966  
##  1st Qu.:13.440  
##  Median :14.520  
##  Mean   :14.265  
##  3rd Qu.:15.322  
##  Max.   :20.003  
##  NA's   :1

Summary information on the structure of the dataframe:

str(kepler_df)
## 'data.frame':    9564 obs. of  46 variables:
##  $ kepoi_name       : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1081 1082 1083 1084 1085 1086 1087 1088 108 1089 ...
##  $ kepler_name      : Factor w/ 2294 levels "Kepler-1 b","Kepler-10 b",..: 1036 1037 NA NA 1868 1040 1039 1038 NA 1042 ...
##  $ koi_disposition  : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 3 3 2 2 2 2 3 2 ...
##  $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 2 2 1 1 1 1 2 1 ...
##  $ koi_score        : num  1 0.969 0 0 1 1 1 0.992 0 1 ...
##  $ koi_fpflag_nt    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_fpflag_ss    : int  0 0 1 1 0 0 0 0 1 0 ...
##  $ koi_fpflag_co    : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ koi_fpflag_ec    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_period       : num  9.49 54.42 19.9 1.74 2.53 ...
##  $ koi_period_err1  : num  2.78e-05 2.48e-04 1.49e-05 2.63e-07 3.76e-06 ...
##  $ koi_period_err2  : num  -2.78e-05 -2.48e-04 -1.49e-05 -2.63e-07 -3.76e-06 ...
##  $ koi_time0bk      : num  171 163 176 170 172 ...
##  $ koi_time0bk_err1 : num  0.00216 0.00352 0.000581 0.000115 0.00113 0.00141 0.0019 0.00461 0.00253 0.000517 ...
##  $ koi_time0bk_err2 : num  -0.00216 -0.00352 -0.000581 -0.000115 -0.00113 -0.00141 -0.0019 -0.00461 -0.00253 -0.000517 ...
##  $ koi_impact       : num  0.146 0.586 0.969 1.276 0.701 ...
##  $ koi_impact_err1  : num  0.318 0.059 5.126 0.115 0.235 ...
##  $ koi_impact_err2  : num  -0.146 -0.443 -0.077 -0.092 -0.478 -0.428 -0.532 -0.523 -0.044 -0.052 ...
##  $ koi_duration     : num  2.96 4.51 1.78 2.41 1.65 ...
##  $ koi_duration_err1: num  0.0819 0.116 0.0341 0.00537 0.042 0.061 0.0673 0.165 0.136 0.0241 ...
##  $ koi_duration_err2: num  -0.0819 -0.116 -0.0341 -0.00537 -0.042 -0.061 -0.0673 -0.165 -0.136 -0.0241 ...
##  $ koi_depth        : num  616 875 10829 8079 603 ...
##  $ koi_depth_err1   : num  19.5 35.5 171 12.8 16.9 24.2 18.7 16.8 5.8 33.3 ...
##  $ koi_depth_err2   : num  -19.5 -35.5 -171 -12.8 -16.9 -24.2 -18.7 -16.8 -5.8 -33.3 ...
##  $ koi_prad         : num  2.26 2.83 14.6 33.46 2.75 ...
##  $ koi_prad_err1    : num  0.26 0.32 3.92 8.5 0.88 1.27 0.9 0.52 6.45 0.22 ...
##  $ koi_prad_err2    : num  -0.15 -0.19 -1.31 -2.83 -0.35 -0.42 -0.3 -0.17 -9.67 -0.49 ...
##  $ koi_teq          : num  793 443 638 1395 1406 ...
##  $ koi_insol        : num  93.59 9.11 39.3 891.96 926.16 ...
##  $ koi_insol_err1   : num  29.45 2.87 31.04 668.95 874.33 ...
##  $ koi_insol_err2   : num  -16.65 -1.62 -10.49 -230.35 -314.24 ...
##  $ koi_model_snr    : num  35.8 25.8 76.3 505.6 40.9 ...
##  $ koi_tce_plnt_num : int  1 2 1 1 1 1 2 3 1 1 ...
##  $ koi_tce_delivname: Factor w/ 3 levels "q1_q16_tce","q1_q17_dr24_tce",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ koi_steff        : num  5455 5455 5853 5805 6031 ...
##  $ koi_steff_err1   : num  81 81 158 157 169 189 189 189 111 75 ...
##  $ koi_steff_err2   : num  -81 -81 -176 -174 -211 -232 -232 -232 -124 -83 ...
##  $ koi_slogg        : num  4.47 4.47 4.54 4.56 4.44 ...
##  $ koi_slogg_err1   : num  0.064 0.064 0.044 0.053 0.07 0.054 0.054 0.054 0.182 0.083 ...
##  $ koi_slogg_err2   : num  -0.096 -0.096 -0.176 -0.168 -0.21 -0.229 -0.229 -0.229 -0.098 -0.028 ...
##  $ koi_srad         : num  0.927 0.927 0.868 0.791 1.046 ...
##  $ koi_srad_err1    : num  0.105 0.105 0.233 0.201 0.334 0.315 0.315 0.315 0.322 0.033 ...
##  $ koi_srad_err2    : num  -0.061 -0.061 -0.078 -0.067 -0.133 -0.105 -0.105 -0.105 -0.483 -0.072 ...
##  $ ra               : num  292 292 297 286 289 ...
##  $ dec              : num  48.1 48.1 48.1 48.3 48.2 ...
##  $ koi_kepmag       : num  15.3 15.3 15.4 15.6 15.5 ...

We see there are many numeric values, some of which pertain to esoteric astronomical measures. Part of the ensuing investigation into the data will be to test our understanding of what certain features mean.

# First extract only those rows where the planet is named.
named_planets_df <- subset(kepler_df, !is.na(kepler_name))
missmap(named_planets_df)

summary(named_planets_df$koi_teq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   129.0   554.0   781.0   839.1  1039.0  3559.0       1
summary(named_planets_df$koi_prad)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.270   1.530   2.170   2.879   2.940  77.760       1

EDA Question #1: Are binary stars more likely to host planets?

There is a “false positive flag” associated with the likely presence of a binary star (i.e. it is set to ‘1’ if the observed light curve is likely due to a binary star). We want to use these flags to determine what proportion of candidate planets are found around (probable) binary stars. We also want to compare what the literature says regarding the planetary hosting ability of binary stars to what the Kepler analysis suggests.

# Let's create labels for the binary star false positive flag
for (i in 1:dim(kepler_df)[1]) {
  if (kepler_df$koi_fpflag_ss[i] == 0)
    kepler_df$koi_fpflag_ss[i] <- "No binary star detected"
  else
    kepler_df$koi_fpflag_ss[i] <- "Probable binary star"
}

# Plot the dispositions according to Kepler data analysis
plot1 <- ggplot(kepler_df, aes(x = koi_fpflag_ss, fill = koi_pdisposition)) + geom_bar(position = "fill")
plot2 <- ggplot(kepler_df, aes(x = koi_fpflag_ss, fill = koi_disposition)) + geom_bar(position = "fill")
plot1

plot2

The above plots show that binary stars have a much smaller proportion of likely planets encircling them than do single stars, and this is seen in both the Kepler analysis labels and literature labels. But this may not reflect the actual capability of binary stars to host planets. These plots could just be a reflection of how difficult it is to detect planets encircling binary star systems.

EDA Question #2: What are the feature distributions of likely habitable planets?

Using the data available we can filter planets according to their likeness to Earth and treat these planets as likely being “habitable”. Although, we should note that the filtering is very crude. The only two criteria we can rely on to filter planets are effective temperature (koi_teq) and radius (koi_prad). Using only these two criteria are not enough to determine the habitableness of a planet. Other data regarding planet composition, for example, are needed to make definitive judgement on habitability. But we’ll proceed with the crude method for the purpose of this analysis.

We’ll take the habitable planets to be approximately Earth size and within a temperature range to support liquid water on the surface. Also, we’ll only look at planets with decent koi_score values (where koi_score is a measure of how certain scientists are the corresponding observation is a planet).

habit_df <- subset(kepler_df, koi_prad >= 0.5 & koi_prad <= 2.0 & koi_teq >= 273 & koi_teq <= 373 & koi_pdisposition == "CANDIDATE" & koi_score >= 0.4)
str(habit_df)
## 'data.frame':    52 obs. of  46 variables:
##  $ kepoi_name       : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1112 1160 1161 1167 1271 1291 1298 1419 1536 1369 ...
##  $ kepler_name      : Factor w/ 2294 levels "Kepler-1 b","Kepler-10 b",..: 1674 1059 1060 1062 1703 1100 1716 1137 1154 1951 ...
##  $ koi_disposition  : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ koi_score        : num  1 1 1 1 0.986 0.998 1 0.881 0.992 1 ...
##  $ koi_fpflag_nt    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_fpflag_ss    : chr  "No binary star detected" "No binary star detected" "No binary star detected" "No binary star detected" ...
##  $ koi_fpflag_co    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_fpflag_ec    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_period       : num  36.4 20.1 46.2 24 21 ...
##  $ koi_period_err1  : num  1.81e-04 5.84e-05 2.65e-04 9.25e-05 7.66e-05 ...
##  $ koi_period_err2  : num  -1.81e-04 -5.84e-05 -2.65e-04 -9.25e-05 -7.66e-05 ...
##  $ koi_time0bk      : num  152 147 165 186 152 ...
##  $ koi_time0bk_err1 : num  0.00392 0.00241 0.0043 0.00305 0.00305 0.00229 0.00485 0.00932 0.00254 0.00157 ...
##  $ koi_time0bk_err2 : num  -0.00392 -0.00241 -0.0043 -0.00305 -0.00305 -0.00229 -0.00485 -0.00932 -0.00254 -0.00157 ...
##  $ koi_impact       : num  0.028 0.556 0.013 0.416 0.228 0.115 0.045 0.015 0.035 0.009 ...
##  $ koi_impact_err1  : num  0.437 0.323 0.415 0.053 0.176 0.32 0.393 0.464 0.395 0.374 ...
##  $ koi_impact_err2  : num  -0.028 -0.368 -0.013 -0.416 -0.228 -0.115 -0.045 -0.015 -0.035 -0.009 ...
##  $ koi_duration     : num  4.01 3.32 4.76 3.83 2.63 ...
##  $ koi_duration_err1: num  0.126 0.0754 0.13 0.113 0.0866 0.069 0.155 0.331 0.0806 0.0556 ...
##  $ koi_duration_err2: num  -0.126 -0.0754 -0.13 -0.113 -0.0866 -0.069 -0.155 -0.331 -0.0806 -0.0556 ...
##  $ koi_depth        : num  1122 1495 1395 1182 767 ...
##  $ koi_depth_err1   : num  53.4 45.1 56.4 46.1 42.9 26.4 43.6 27.7 60.3 33.9 ...
##  $ koi_depth_err2   : num  -53.4 -45.1 -56.4 -46.1 -42.9 -26.4 -43.6 -27.7 -60.3 -33.9 ...
##  $ koi_prad         : num  1.99 1.96 1.83 1.8 1.3 1.11 1.85 1.75 1.87 1.83 ...
##  $ koi_prad_err1    : num  0.09 0.13 0.12 0.1 0.1 0.12 0.13 0.11 0.14 0.16 ...
##  $ koi_prad_err2    : num  -0.09 -0.16 -0.15 -0.15 -0.15 -0.16 -0.05 -0.13 -0.22 -0.21 ...
##  $ koi_teq          : num  332 361 273 329 328 332 349 372 301 298 ...
##  $ koi_insol        : num  2.88 4 1.32 2.77 2.74 2.86 3.49 4.53 1.95 1.87 ...
##  $ koi_insol_err1   : num  0.51 0.89 0.29 0.58 0.7 0.95 0.81 1.04 0.49 0.52 ...
##  $ koi_insol_err2   : num  -0.47 -0.9 -0.3 -0.64 -0.79 -0.98 -0.47 -0.95 -0.55 -0.53 ...
##  $ koi_model_snr    : num  21.7 35.2 26 27.5 18.2 30.9 21.6 17.4 28.9 53.8 ...
##  $ koi_tce_plnt_num : int  3 2 3 1 3 3 5 2 3 1 ...
##  $ koi_tce_delivname: Factor w/ 3 levels "q1_q16_tce","q1_q17_dr24_tce",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ koi_steff        : num  4126 3950 3950 3747 3713 ...
##  $ koi_steff_err1   : num  82 70 70 75 74 71 90 105 75 75 ...
##  $ koi_steff_err2   : num  -82 -86 -86 -83 -92 -89 -90 -105 -82 -84 ...
##  $ koi_slogg        : num  4.66 4.75 4.75 4.73 4.78 ...
##  $ koi_slogg_err1   : num  0.022 0.042 0.042 0.042 0.063 0.055 0.013 0.072 0.063 0.063 ...
##  $ koi_slogg_err2   : num  -0.022 -0.031 -0.031 -0.025 -0.031 -0.055 -0.043 -0.048 -0.031 -0.031 ...
##  $ koi_srad         : num  0.615 0.493 0.493 0.524 0.47 0.411 0.646 0.849 0.46 0.461 ...
##  $ koi_srad_err1    : num  0.027 0.033 0.033 0.03 0.036 0.045 0.043 0.056 0.035 0.04 ...
##  $ koi_srad_err2    : num  -0.029 -0.04 -0.04 -0.044 -0.054 -0.06 -0.02 -0.063 -0.053 -0.053 ...
##  $ ra               : num  287 286 286 284 295 ...
##  $ dec              : num  50 39.3 39.3 39.9 43.1 ...
##  $ koi_kepmag       : num  15.1 16 16 15.4 15.8 ...
summary(habit_df)
##      kepoi_name        kepler_name       koi_disposition
##  K00172.02: 1   Kepler-1185 b: 1   CANDIDATE     :23    
##  K00238.03: 1   Kepler-138 d : 1   CONFIRMED     :29    
##  K00248.04: 1   Kepler-1450 b: 1   FALSE POSITIVE: 0    
##  K00253.02: 1   Kepler-1459 b: 1                        
##  K00314.02: 1   Kepler-1512 b: 1                        
##  K00494.01: 1   (Other)      :24                        
##  (Other)  :46   NA's         :23                        
##        koi_pdisposition   koi_score      koi_fpflag_nt koi_fpflag_ss     
##  CANDIDATE     :52      Min.   :0.5230   Min.   :0     Length:52         
##  FALSE POSITIVE: 0      1st Qu.:0.9223   1st Qu.:0     Class :character  
##                         Median :0.9920   Median :0     Mode  :character  
##                         Mean   :0.9308   Mean   :0                       
##                         3rd Qu.:1.0000   3rd Qu.:0                       
##                         Max.   :1.0000   Max.   :0                       
##                                                                          
##  koi_fpflag_co koi_fpflag_ec   koi_period      koi_period_err1    
##  Min.   :0     Min.   :0     Min.   :  4.486   Min.   :0.0000077  
##  1st Qu.:0     1st Qu.:0     1st Qu.: 20.901   1st Qu.:0.0000753  
##  Median :0     Median :0     Median : 37.498   Median :0.0002705  
##  Mean   :0     Mean   :0     Mean   : 63.967   Mean   :0.0010320  
##  3rd Qu.:0     3rd Qu.:0     3rd Qu.: 77.912   3rd Qu.:0.0009774  
##  Max.   :0     Max.   :0     Max.   :362.978   Max.   :0.0159900  
##                                                NA's   :1          
##  koi_period_err2       koi_time0bk    koi_time0bk_err1  
##  Min.   :-0.0159900   Min.   :131.1   Min.   :0.000811  
##  1st Qu.:-0.0009774   1st Qu.:139.9   1st Qu.:0.002710  
##  Median :-0.0002705   Median :151.2   Median :0.005190  
##  Mean   :-0.0010320   Mean   :160.7   Mean   :0.008115  
##  3rd Qu.:-0.0000753   3rd Qu.:166.1   3rd Qu.:0.009940  
##  Max.   :-0.0000077   Max.   :280.1   Max.   :0.051800  
##  NA's   :1                            NA's   :1         
##  koi_time0bk_err2      koi_impact      koi_impact_err1  koi_impact_err2  
##  Min.   :-0.051800   Min.   :0.00400   Min.   :0.0000   Min.   :-0.7030  
##  1st Qu.:-0.009940   1st Qu.:0.04575   1st Qu.:0.0740   1st Qu.:-0.4360  
##  Median :-0.005190   Median :0.22250   Median :0.3140   Median :-0.2170  
##  Mean   :-0.008115   Mean   :0.32267   Mean   :0.2604   Mean   :-0.2559  
##  3rd Qu.:-0.002710   3rd Qu.:0.53400   3rd Qu.:0.4120   3rd Qu.:-0.0455  
##  Max.   :-0.000811   Max.   :0.95400   Max.   :0.5350   Max.   :-0.0040  
##  NA's   :1                             NA's   :1        NA's   :1        
##   koi_duration     koi_duration_err1 koi_duration_err2   koi_depth     
##  Min.   : 0.8161   Min.   :0.0301    Min.   :-1.4000   Min.   : 243.8  
##  1st Qu.: 2.4299   1st Qu.:0.0836    1st Qu.:-0.3035   1st Qu.: 374.8  
##  Median : 3.7890   Median :0.1530    Median :-0.1530   Median : 606.9  
##  Mean   : 4.4728   Mean   :0.2433    Mean   :-0.2433   Mean   : 843.8  
##  3rd Qu.: 5.0238   3rd Qu.:0.3035    3rd Qu.:-0.0836   3rd Qu.: 922.5  
##  Max.   :16.0300   Max.   :1.4000    Max.   :-0.0301   Max.   :6462.0  
##                    NA's   :1         NA's   :1                         
##  koi_depth_err1   koi_depth_err2       koi_prad     koi_prad_err1   
##  Min.   :  8.90   Min.   :-203.00   Min.   :0.790   Min.   :0.0500  
##  1st Qu.: 26.95   1st Qu.: -54.35   1st Qu.:1.270   1st Qu.:0.0975  
##  Median : 37.40   Median : -37.40   Median :1.565   Median :0.1200  
##  Mean   : 47.65   Mean   : -47.65   Mean   :1.529   Mean   :0.1542  
##  3rd Qu.: 54.35   3rd Qu.: -26.95   3rd Qu.:1.830   3rd Qu.:0.1650  
##  Max.   :203.00   Max.   :  -8.90   Max.   :1.990   Max.   :0.7000  
##  NA's   :1        NA's   :1                                         
##  koi_prad_err2        koi_teq        koi_insol     koi_insol_err1  
##  Min.   :-0.3300   Min.   :273.0   Min.   :1.320   Min.   :0.2900  
##  1st Qu.:-0.1725   1st Qu.:304.5   1st Qu.:2.045   1st Qu.:0.5275  
##  Median :-0.1450   Median :329.5   Median :2.780   Median :0.7000  
##  Mean   :-0.1458   Mean   :327.0   Mean   :2.812   Mean   :0.9652  
##  3rd Qu.:-0.1175   3rd Qu.:349.0   3rd Qu.:3.495   3rd Qu.:1.0500  
##  Max.   :-0.0500   Max.   :373.0   Max.   :4.590   Max.   :4.1600  
##                                                                    
##  koi_insol_err2    koi_model_snr   koi_tce_plnt_num
##  Min.   :-1.6700   Min.   : 5.10   Min.   :1.000   
##  1st Qu.:-0.9600   1st Qu.:12.97   1st Qu.:1.000   
##  Median :-0.6300   Median :16.55   Median :1.000   
##  Mean   :-0.7208   Mean   :21.42   Mean   :1.635   
##  3rd Qu.:-0.4775   3rd Qu.:27.20   3rd Qu.:2.000   
##  Max.   :-0.3000   Max.   :57.40   Max.   :5.000   
##                                                    
##        koi_tce_delivname   koi_steff    koi_steff_err1   koi_steff_err2  
##  q1_q16_tce     : 0      Min.   :3157   Min.   : 41.00   Min.   :-219.0  
##  q1_q17_dr24_tce: 0      1st Qu.:3750   1st Qu.: 74.00   1st Qu.:-129.2  
##  q1_q17_dr25_tce:52      Median :4129   Median : 82.50   Median : -88.5  
##                          Mean   :4367   Mean   : 97.94   Mean   :-103.9  
##                          3rd Qu.:4899   3rd Qu.:115.50   3rd Qu.: -82.0  
##                          Max.   :6086   Max.   :219.00   Max.   : -25.0  
##                                                                          
##    koi_slogg     koi_slogg_err1    koi_slogg_err2        koi_srad     
##  Min.   :4.274   Min.   :0.01000   Min.   :-0.20400   Min.   :0.1800  
##  1st Qu.:4.560   1st Qu.:0.03600   1st Qu.:-0.05525   1st Qu.:0.4753  
##  Median :4.691   Median :0.05300   Median :-0.03550   Median :0.5580  
##  Mean   :4.684   Mean   :0.05679   Mean   :-0.05183   Mean   :0.6104  
##  3rd Qu.:4.776   3rd Qu.:0.06750   3rd Qu.:-0.03075   3rd Qu.:0.7750  
##  Max.   :5.112   Max.   :0.13700   Max.   :-0.01000   Max.   :1.2160  
##                                                                       
##  koi_srad_err1     koi_srad_err2            ra             dec       
##  Min.   :0.02200   Min.   :-0.17900   Min.   :281.6   Min.   :37.36  
##  1st Qu.:0.03300   1st Qu.:-0.06500   1st Qu.:286.7   1st Qu.:41.02  
##  Median :0.04350   Median :-0.05000   Median :290.8   Median :43.99  
##  Mean   :0.06133   Mean   :-0.05729   Mean   :290.7   Mean   :43.76  
##  3rd Qu.:0.05875   3rd Qu.:-0.03875   3rd Qu.:294.6   3rd Qu.:46.13  
##  Max.   :0.23800   Max.   :-0.02000   Max.   :299.8   Max.   :50.70  
##                                                                      
##    koi_kepmag   
##  Min.   :12.57  
##  1st Qu.:14.40  
##  Median :15.21  
##  Mean   :15.00  
##  3rd Qu.:15.73  
##  Max.   :17.48  
## 
head(habit_df)
##     kepoi_name  kepler_name koi_disposition koi_pdisposition koi_score
## 57   K00775.03  Kepler-52 d       CONFIRMED        CANDIDATE     1.000
## 86   K00812.02 Kepler-235 d       CONFIRMED        CANDIDATE     1.000
## 87   K00812.03 Kepler-235 e       CONFIRMED        CANDIDATE     1.000
## 115  K00817.01 Kepler-236 c       CONFIRMED        CANDIDATE     1.000
## 223  K00886.03  Kepler-54 d       CONFIRMED        CANDIDATE     0.986
## 246  K00899.03 Kepler-249 d       CONFIRMED        CANDIDATE     0.998
##     koi_fpflag_nt           koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 57              0 No binary star detected             0             0
## 86              0 No binary star detected             0             0
## 87              0 No binary star detected             0             0
## 115             0 No binary star detected             0             0
## 223             0 No binary star detected             0             0
## 246             0 No binary star detected             0             0
##     koi_period koi_period_err1 koi_period_err2 koi_time0bk
## 57    36.44540       1.809e-04      -1.809e-04    151.6012
## 86    20.06036       5.839e-05      -5.839e-05    147.4655
## 87    46.18420       2.654e-04      -2.654e-04    165.2373
## 115   23.96794       9.249e-05      -9.249e-05    186.2218
## 223   20.99588       7.662e-05      -7.662e-05    152.3155
## 246   15.36846       4.067e-05      -4.067e-05    147.3896
##     koi_time0bk_err1 koi_time0bk_err2 koi_impact koi_impact_err1
## 57           0.00392         -0.00392      0.028           0.437
## 86           0.00241         -0.00241      0.556           0.323
## 87           0.00430         -0.00430      0.013           0.415
## 115          0.00305         -0.00305      0.416           0.053
## 223          0.00305         -0.00305      0.228           0.176
## 246          0.00229         -0.00229      0.115           0.320
##     koi_impact_err2 koi_duration koi_duration_err1 koi_duration_err2
## 57           -0.028       4.0070            0.1260           -0.1260
## 86           -0.368       3.3203            0.0754           -0.0754
## 87           -0.013       4.7580            0.1300           -0.1300
## 115          -0.416       3.8270            0.1130           -0.1130
## 223          -0.228       2.6333            0.0866           -0.0866
## 246          -0.115       2.4714            0.0690           -0.0690
##     koi_depth koi_depth_err1 koi_depth_err2 koi_prad koi_prad_err1
## 57     1122.3           53.4          -53.4     1.99          0.09
## 86     1494.7           45.1          -45.1     1.96          0.13
## 87     1394.7           56.4          -56.4     1.83          0.12
## 115    1182.3           46.1          -46.1     1.80          0.10
## 223     767.3           42.9          -42.9     1.30          0.10
## 246     756.1           26.4          -26.4     1.11          0.12
##     koi_prad_err2 koi_teq koi_insol koi_insol_err1 koi_insol_err2
## 57          -0.09     332      2.88           0.51          -0.47
## 86          -0.16     361      4.00           0.89          -0.90
## 87          -0.15     273      1.32           0.29          -0.30
## 115         -0.15     329      2.77           0.58          -0.64
## 223         -0.15     328      2.74           0.70          -0.79
## 246         -0.16     332      2.86           0.95          -0.98
##     koi_model_snr koi_tce_plnt_num koi_tce_delivname koi_steff
## 57           21.7                3   q1_q17_dr25_tce      4126
## 86           35.2                2   q1_q17_dr25_tce      3950
## 87           26.0                3   q1_q17_dr25_tce      3950
## 115          27.5                1   q1_q17_dr25_tce      3747
## 223          18.2                3   q1_q17_dr25_tce      3713
## 246          30.9                3   q1_q17_dr25_tce      3561
##     koi_steff_err1 koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2
## 57              82            -82     4.661          0.022         -0.022
## 86              70            -86     4.754          0.042         -0.031
## 87              70            -86     4.754          0.042         -0.031
## 115             75            -83     4.728          0.042         -0.025
## 223             74            -92     4.779          0.063         -0.031
## 246             71            -89     4.855          0.055         -0.055
##     koi_srad koi_srad_err1 koi_srad_err2       ra      dec koi_kepmag
## 57     0.615         0.027        -0.029 286.7380 49.97575     15.095
## 86     0.493         0.033        -0.040 286.0791 39.27832     15.954
## 87     0.493         0.033        -0.040 286.0791 39.27832     15.954
## 115    0.524         0.030        -0.044 283.8664 39.89808     15.414
## 223    0.470         0.036        -0.054 294.7739 43.05630     15.847
## 246    0.411         0.045        -0.060 296.9851 43.65852     15.234

Now let’s visualize the features of the “haitable” planets.

 # Temperature distribution
ggplot(habit_df, aes(x = koi_teq, fill = koi_disposition)) + geom_bar(binwidth = 9) + xlab("Effective Temperature (Kelvin)") + labs(title = "Temperature Distribution of Likely Habitable Planets\n") + geom_vline(xintercept=252, colour="orange", linetype = "longdash") + annotate("text", x = 267, y = 6, label = "Effective Temp\nof Earth")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.

Observation: it appears that all the Earth-like planets in the dataset are considerably warmer than the Earth.

# Radius distribution
ggplot(habit_df, aes(x = koi_prad, fill = koi_disposition)) + geom_bar(binwidth = 0.1) + xlab("Planetary Radius (Earth Radii)") + labs(title = "Planetary Radius Distribution of Likely Habitable Planets\n") + geom_vline(xintercept=1, colour="orange", linetype = "longdash") + annotate("text", x = 1.14, y = 6, label = "Earth Radius")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.

Observation: most of the Earth-like planets in the dataset are substantially larger than the Earth.

Let’s investigage sky-projected distances.

# koi_impact distribution
ggplot(habit_df, aes(x = koi_impact, fill = koi_disposition)) + geom_bar(binwidth = 0.1) + xlab("Sky-Projected Distance") + labs(title = "Sky-Projected Distance Distribution of Likely Habitable Planets\n")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.

Observation: The distribution looked negative exponential, and suggested Earth-like planets tended to have smaller sky-projected distances. We were not certain if the negative exponential shape lended itself to any special interpretation, or if koi_impact measures related to any sort of Poisson point process. Such an interpretation was especially difficult to make seeing that we did not really know what sky-projected distance represented.

We did suspect, however, that sky-projected distance was a proxy for the actual distance between a planet and its star. This suspicion arose out of the fact that the “goldilocks” zone for a planet tended to be closer to the star. Therefore we expected habitable planets to be located somewhat closer to stars. A continuation of the investigation into sky-projected distance is detailed below…

But first we finish detailing our analysis of likely habitable planets. Here is a scatterplot of the Earth-like planets.

ggplot(habit_df, aes(x = koi_prad, y = koi_teq, size = koi_score)) + geom_point(aes(color = koi_disposition)) + labs(title = "Plot of Likely Habitable Planets\n") + xlab("Planetary Radius (Earth Radii)") + ylab("Effective Temperature (Kelvin)")

Observation: There’s no information here that was not revealed above. The likely habitable planets in the dataset were typically larger and warmer than the Earth.

EDA Question #3: What does sky-projected distance represent?

Now let’s proceed to better understand sky-projected distance. We had a hypothesis that sky-projected distance was a proxy for actual distance from a star. Luckily, we had planet features that related to the orbital speeds of planets. The laws of physics dictate that planets further out from a star are generally slower moving. So, by plotting sky-projected distance in relation to transit duration, or period between transits, we could see if sky-projected distance increased with larger transit duration.

But first, we examine the distribution of sky-projected distances for all planets.

# Let's take a dataset where we have sky-projected distance data, where the koi_score is fairly high, and where there is no FLASE POSITIVE label. The motivation for this is to remove any erroneous data associated with observations that are not likely to be planets.
test_df <- subset(kepler_df, !is.na(koi_impact) & koi_score > 0.5 & koi_disposition != "FALSE POSITIVE" & koi_disposition != "FALSE POSITIVE" & koi_impact <= 1.0)
ggplot(test_df, aes(x = koi_impact, fill = koi_disposition)) + geom_histogram(binwidth = 0.01) + xlab("Sky-Projected Distance") + labs(title = "Sky-Projected Distance Distribution of Likely Planets\n") + xlim(c(0,1))

summary(test_df$koi_impact)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1030  0.3935  0.4343  0.7480  1.0000
print(mean(test_df$koi_impact))
## [1] 0.434295
print(sd(test_df$koi_impact))
## [1] 0.3308789

As with the likely habitable planets, we see a negative exponential-looking distributiion for sky-projected distance. Therefore the hypothesis mentioned above is shown to be false: it’s not just habitable planets that tend to have smaller koi_impact. Planets, in general, are likely to have small koi_impact values. This puts koi_impact into doubt as a predictor/proxy for actual distance of a planet from a star.

EDA Question #4: What do the stars of Earth-like planets look like, and how do they compare to our sun?

Then we turn our attention to the stars that host Earth-like planets. How did they compare to our Sun?

We plot the star radii v. their photospheric temperature.

temp_df <- habit_df[,c("koi_steff","koi_slogg","koi_srad")]
sun <- c(5778, 2.43775056282, 1.00)
temp_df[dim(temp_df)[1]+1,] <- sun


ggplot(temp_df, aes(x = koi_srad, y = koi_steff)) + geom_point(aes(colour = koi_slogg), size = 5) + scale_colour_gradient2(low = "#FF3300", mid = "white", high = "#663300", midpoint = 3.6) + xlab("Photospheric Radius of the Star (Normalized to Sun's Radius)") + ylab("Photospheric Temperature of the Star (Kelvin)") + labs(title = "Stars with Potentially Habitable Planets\n", color = "Base-10 Log\nof Surface Gravity\nAcceleration") + annotate("text", x = 1.06, y = 5778, label = "Sun") + geom_smooth(method='lm',formula=y~x) + annotate("text", x = 0.5, y = 6000, label = paste("R-square =", round(cor(temp_df$koi_srad,temp_df$koi_steff)^2, digits = 4)))

# Print the correlation
print(cor(temp_df$koi_srad,temp_df$koi_steff))
## [1] 0.9548735

Observation: Most of the stars that host Earth-like planets seemed to be smaller, cooler, and have larger surface accelerations when comapred to our Sun. What was also interesting to see was the fairly high degreee of correlation observed between photospheric radius and temperature of stars hosting Earth-like planets. The correlation table above, built for all candidate planets with a fairly high koi_score, did not suggest this would be the case. But when we look at the subset of stars that host Earth-like planets, the relation is apparent. It’s not clear as to why this is the case, however.

EDA Question #5: Do the Earth-like planets congregate within certain patches of the night sky?

A Kaggle user used right ascension and declination data, the celestial coordinates of the observations, to see where candidates and confirmed planets were being observed. We wanted to do something similar, but use the resulting plot in a different way: to see if Earth-like planets were restricted to certain patches of the night sky.

Add labels to the overall dataset: “Earth-like” and “Not Earth-like”.

for (i in 1:dim(kepler_df)[1]) {
  # Do a check for NAs
    na_check <- is.na(kepler_df$koi_prad[i]) | is.na(kepler_df$koi_teq[i]) | is.na(kepler_df$koi_score[i]) | is.na(kepler_df$koi_pdisposition[i])
  if (!na_check) {
    if (kepler_df$koi_prad[i] >= 0.5 & kepler_df$koi_prad[i] <= 2.0 & kepler_df$koi_teq[i] >= 273 & kepler_df$koi_teq[i] <= 373 & kepler_df$koi_pdisposition[i] == "CANDIDATE" & kepler_df$koi_score[i] >= 0.4) {
      kepler_df$koi_els[i] <- "Earth-like"
    } else {
      kepler_df$koi_els[i] <- "Not Earth-like"
    }
  }
}

Now we plot celestial coordinate data overlaid with with the new koi_els feature information.

el_df <- subset(kepler_df, koi_els == "Earth-like")

ggplot(kepler_df, aes(x = ra, y = dec)) + geom_point(aes(colour = koi_els), size = 1.5) + xlab("Right Ascension") + ylab("Declination") + labs(title = "Celestial Positioning of Observations\n") 

As suspected, the observations corresponding to Earth-like planets are spread out accross the patches of celestial coordinates observed by Kepler. Another hypothesis was that perhaps the patches with a higher density of observations would also have more observations of Earth-like planets. This does not seem to be the case, however.

Step 3: Data Exploration

Cleaning Data

starData <- read.csv("cumulative.csv", header = TRUE)
head(starData)
##   rowid    kepid kepoi_name  kepler_name koi_disposition koi_pdisposition
## 1     1 10797460  K00752.01 Kepler-227 b       CONFIRMED        CANDIDATE
## 2     2 10797460  K00752.02 Kepler-227 c       CONFIRMED        CANDIDATE
## 3     3 10811496  K00753.01               FALSE POSITIVE   FALSE POSITIVE
## 4     4 10848459  K00754.01               FALSE POSITIVE   FALSE POSITIVE
## 5     5 10854555  K00755.01 Kepler-664 b       CONFIRMED        CANDIDATE
## 6     6 10872983  K00756.01 Kepler-228 d       CONFIRMED        CANDIDATE
##   koi_score koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1     1.000             0             0             0             0
## 2     0.969             0             0             0             0
## 3     0.000             0             1             0             0
## 4     0.000             0             1             0             0
## 5     1.000             0             0             0             0
## 6     1.000             0             0             0             0
##   koi_period koi_period_err1 koi_period_err2 koi_time0bk koi_time0bk_err1
## 1   9.488036       2.775e-05      -2.775e-05    170.5387         0.002160
## 2  54.418383       2.479e-04      -2.479e-04    162.5138         0.003520
## 3  19.899140       1.494e-05      -1.494e-05    175.8503         0.000581
## 4   1.736952       2.630e-07      -2.630e-07    170.3076         0.000115
## 5   2.525592       3.761e-06      -3.761e-06    171.5956         0.001130
## 6  11.094321       2.036e-05      -2.036e-05    171.2012         0.001410
##   koi_time0bk_err2 koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1        -0.002160      0.146           0.318          -0.146      2.95750
## 2        -0.003520      0.586           0.059          -0.443      4.50700
## 3        -0.000581      0.969           5.126          -0.077      1.78220
## 4        -0.000115      1.276           0.115          -0.092      2.40641
## 5        -0.001130      0.701           0.235          -0.478      1.65450
## 6        -0.001410      0.538           0.030          -0.428      4.59450
##   koi_duration_err1 koi_duration_err2 koi_depth koi_depth_err1
## 1           0.08190          -0.08190     615.8           19.5
## 2           0.11600          -0.11600     874.8           35.5
## 3           0.03410          -0.03410   10829.0          171.0
## 4           0.00537          -0.00537    8079.2           12.8
## 5           0.04200          -0.04200     603.3           16.9
## 6           0.06100          -0.06100    1517.5           24.2
##   koi_depth_err2 koi_prad koi_prad_err1 koi_prad_err2 koi_teq koi_teq_err1
## 1          -19.5     2.26          0.26         -0.15     793           NA
## 2          -35.5     2.83          0.32         -0.19     443           NA
## 3         -171.0    14.60          3.92         -1.31     638           NA
## 4          -12.8    33.46          8.50         -2.83    1395           NA
## 5          -16.9     2.75          0.88         -0.35    1406           NA
## 6          -24.2     3.90          1.27         -0.42     835           NA
##   koi_teq_err2 koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr
## 1           NA     93.59          29.45         -16.65          35.8
## 2           NA      9.11           2.87          -1.62          25.8
## 3           NA     39.30          31.04         -10.49          76.3
## 4           NA    891.96         668.95        -230.35         505.6
## 5           NA    926.16         874.33        -314.24          40.9
## 6           NA    114.81         112.85         -36.70          66.5
##   koi_tce_plnt_num koi_tce_delivname koi_steff koi_steff_err1
## 1                1   q1_q17_dr25_tce      5455             81
## 2                2   q1_q17_dr25_tce      5455             81
## 3                1   q1_q17_dr25_tce      5853            158
## 4                1   q1_q17_dr25_tce      5805            157
## 5                1   q1_q17_dr25_tce      6031            169
## 6                1   q1_q17_dr25_tce      6046            189
##   koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad
## 1            -81     4.467          0.064         -0.096    0.927
## 2            -81     4.467          0.064         -0.096    0.927
## 3           -176     4.544          0.044         -0.176    0.868
## 4           -174     4.564          0.053         -0.168    0.791
## 5           -211     4.438          0.070         -0.210    1.046
## 6           -232     4.486          0.054         -0.229    0.972
##   koi_srad_err1 koi_srad_err2       ra      dec koi_kepmag
## 1         0.105        -0.061 291.9342 48.14165     15.347
## 2         0.105        -0.061 291.9342 48.14165     15.347
## 3         0.233        -0.078 297.0048 48.13413     15.436
## 4         0.201        -0.067 285.5346 48.28521     15.597
## 5         0.334        -0.133 288.7549 48.22620     15.509
## 6         0.315        -0.105 296.2861 48.22467     15.714

Assigning NA to the blank values in the entire dataset

starData[starData ==""] <- NA

List the name and number of the columns that have at least one missing value.

naCol <-  which(colMeans(is.na(starData))>0) 
naCol
##       kepler_name         koi_score   koi_period_err1   koi_period_err2 
##                 4                 7                13                14 
##  koi_time0bk_err1  koi_time0bk_err2        koi_impact   koi_impact_err1 
##                16                17                18                19 
##   koi_impact_err2 koi_duration_err1 koi_duration_err2         koi_depth 
##                20                22                23                24 
##    koi_depth_err1    koi_depth_err2          koi_prad     koi_prad_err1 
##                25                26                27                28 
##     koi_prad_err2           koi_teq      koi_teq_err1      koi_teq_err2 
##                29                30                31                32 
##         koi_insol    koi_insol_err1    koi_insol_err2     koi_model_snr 
##                33                34                35                36 
##  koi_tce_plnt_num koi_tce_delivname         koi_steff    koi_steff_err1 
##                37                38                39                40 
##    koi_steff_err2         koi_slogg    koi_slogg_err1    koi_slogg_err2 
##                41                42                43                44 
##          koi_srad     koi_srad_err1     koi_srad_err2        koi_kepmag 
##                45                46                47                50
naVal<- vector()

for (colnum in 1:50) {
  
  naVal[colnum] <- sum(complete.cases(starData[colnum])==FALSE) 
  
}
naVal<-naVal[naVal!=0]
NaData <- data.frame(naCol,naVal)
barplot(NaData$naVal,main = "Missing Value Counts", names.arg = NaData$naCol, cex.names = 0.5,
    xlab="column names", col="red")

titleLabels[4]
## [1] "kepler_name"
titleLabels[7]
## [1] "koi_score"
titleLabels[31]
## [1] "koi_teq_err1"
titleLabels[32]
## [1] "koi_teq_err2"

As shown above, the columns/features with the most missing values are ranked as follows: 1.“koi_teq_err1” & “koi_teq_err2” 2.“kepler_name” 3.“koi_score”

Since there are many occurences of missing values in many of the columns, it is unreasonable to delete all the missing value data. Disregarding the top 4 columns with the most missing values, there seems to be 10% of missing values for many of the columns. If we were to delete all the rows with at least one missing data, we wouldn’t be removing 10% of the data, it would be close to 50% since the missing values are not all located in the same rows.

Instead of remvoing all the rows with missing values, we did two different things: insert the mean to the missing numerical values and insert the median to the missing categorical data.

starData<- starData[-c(4,7,31,32)]
levels(starData$koi_tce_delivname)
## [1] ""                "q1_q16_tce"      "q1_q17_dr24_tce" "q1_q17_dr25_tce"
levels(starData$koi_tce_delivname)
## [1] ""                "q1_q16_tce"      "q1_q17_dr24_tce" "q1_q17_dr25_tce"
dataA <- subset(starData,starData$koi_tce_delivname =="q1_q16_tce")
dataB <- subset(starData,starData$koi_tce_delivname =="q1_q17_dr24_tce")
dataC <- subset(starData,starData$koi_tce_delivname =="q1_q17_dr25_tce")
dim(dataA)[1]
## [1] 796
dim(dataB)[1]
## [1] 368
dim(dataC)[1]
## [1] 8054
starData$koi_tce_delivname[is.na(starData$koi_tce_delivname)] <- "q1_q17_dr25_tce"
starData$koi_tce_delivname <- factor(starData$koi_tce_delivname)
levels(starData$koi_tce_delivname)
## [1] "q1_q16_tce"      "q1_q17_dr24_tce" "q1_q17_dr25_tce"
starData$koi_tce_plnt_num <- as.factor(starData$koi_tce_plnt_num)

levels(starData$koi_tce_plnt_num)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Mode(starData$koi_tce_plnt_num)
## [1] 1
## Levels: 1 2 3 4 5 6 7 8
starData$koi_tce_plnt_num[is.na(starData$koi_tce_plnt_num)] <- "1"
for(i in 6:32){
  starData
  starData[is.na(starData[,i]), i] <- mean(starData[,i], na.rm = TRUE)

}
for(i in 35:46){
  starData
  starData[is.na(starData[,i]), i] <- mean(starData[,i], na.rm = TRUE)

}
sum(complete.cases(starData)=="FALSE")
## [1] 0
starData <- starData[-c(1:3,5)]
starData$koi_fpflag_co <- as.factor(starData$koi_fpflag_co)
starData$koi_fpflag_ec <- as.factor(starData$koi_fpflag_ec)
starData$koi_fpflag_nt <- as.factor(starData$koi_fpflag_nt)
starData$koi_fpflag_ss <- as.factor(starData$koi_fpflag_ss)
starData$koi_tce_delivname <- as.factor(starData$koi_tce_delivname)
starData$koi_tce_plnt_num <- as.factor(starData$koi_tce_plnt_num)

Rearrange Dataset that it seperates categorical variables and numerical variables

starData <- starData[c(1:5,29,30,6:28,31:42)]
head(starData)
##   koi_disposition koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1       CONFIRMED             0             0             0             0
## 2       CONFIRMED             0             0             0             0
## 3  FALSE POSITIVE             0             1             0             0
## 4  FALSE POSITIVE             0             1             0             0
## 5       CONFIRMED             0             0             0             0
## 6       CONFIRMED             0             0             0             0
##   koi_tce_plnt_num koi_tce_delivname koi_period koi_period_err1
## 1                1   q1_q17_dr25_tce   9.488036       2.775e-05
## 2                2   q1_q17_dr25_tce  54.418383       2.479e-04
## 3                1   q1_q17_dr25_tce  19.899140       1.494e-05
## 4                1   q1_q17_dr25_tce   1.736952       2.630e-07
## 5                1   q1_q17_dr25_tce   2.525592       3.761e-06
## 6                1   q1_q17_dr25_tce  11.094321       2.036e-05
##   koi_period_err2 koi_time0bk koi_time0bk_err1 koi_time0bk_err2 koi_impact
## 1      -2.775e-05    170.5387         0.002160        -0.002160      0.146
## 2      -2.479e-04    162.5138         0.003520        -0.003520      0.586
## 3      -1.494e-05    175.8503         0.000581        -0.000581      0.969
## 4      -2.630e-07    170.3076         0.000115        -0.000115      1.276
## 5      -3.761e-06    171.5956         0.001130        -0.001130      0.701
## 6      -2.036e-05    171.2012         0.001410        -0.001410      0.538
##   koi_impact_err1 koi_impact_err2 koi_duration koi_duration_err1
## 1           0.318          -0.146      2.95750           0.08190
## 2           0.059          -0.443      4.50700           0.11600
## 3           5.126          -0.077      1.78220           0.03410
## 4           0.115          -0.092      2.40641           0.00537
## 5           0.235          -0.478      1.65450           0.04200
## 6           0.030          -0.428      4.59450           0.06100
##   koi_duration_err2 koi_depth koi_depth_err1 koi_depth_err2 koi_prad
## 1          -0.08190     615.8           19.5          -19.5     2.26
## 2          -0.11600     874.8           35.5          -35.5     2.83
## 3          -0.03410   10829.0          171.0         -171.0    14.60
## 4          -0.00537    8079.2           12.8          -12.8    33.46
## 5          -0.04200     603.3           16.9          -16.9     2.75
## 6          -0.06100    1517.5           24.2          -24.2     3.90
##   koi_prad_err1 koi_prad_err2 koi_teq koi_insol koi_insol_err1
## 1          0.26         -0.15     793     93.59          29.45
## 2          0.32         -0.19     443      9.11           2.87
## 3          3.92         -1.31     638     39.30          31.04
## 4          8.50         -2.83    1395    891.96         668.95
## 5          0.88         -0.35    1406    926.16         874.33
## 6          1.27         -0.42     835    114.81         112.85
##   koi_insol_err2 koi_model_snr koi_steff koi_steff_err1 koi_steff_err2
## 1         -16.65          35.8      5455             81            -81
## 2          -1.62          25.8      5455             81            -81
## 3         -10.49          76.3      5853            158           -176
## 4        -230.35         505.6      5805            157           -174
## 5        -314.24          40.9      6031            169           -211
## 6         -36.70          66.5      6046            189           -232
##   koi_slogg koi_slogg_err1 koi_slogg_err2 koi_srad koi_srad_err1
## 1     4.467          0.064         -0.096    0.927         0.105
## 2     4.467          0.064         -0.096    0.927         0.105
## 3     4.544          0.044         -0.176    0.868         0.233
## 4     4.564          0.053         -0.168    0.791         0.201
## 5     4.438          0.070         -0.210    1.046         0.334
## 6     4.486          0.054         -0.229    0.972         0.315
##   koi_srad_err2       ra      dec koi_kepmag
## 1        -0.061 291.9342 48.14165     15.347
## 2        -0.061 291.9342 48.14165     15.347
## 3        -0.078 297.0048 48.13413     15.436
## 4        -0.067 285.5346 48.28521     15.597
## 5        -0.133 288.7549 48.22620     15.509
## 6        -0.105 296.2861 48.22467     15.714

Step 4 - 6: Data Modelling, Data Analysis, and Data Visualization

Create randomized training and testing set

num_samples = dim(starData)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE) 
trainingSet <- subset(starData[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(starData[testing, ])
names(trainingSet)
##  [1] "koi_disposition"   "koi_fpflag_nt"     "koi_fpflag_ss"    
##  [4] "koi_fpflag_co"     "koi_fpflag_ec"     "koi_tce_plnt_num" 
##  [7] "koi_tce_delivname" "koi_period"        "koi_period_err1"  
## [10] "koi_period_err2"   "koi_time0bk"       "koi_time0bk_err1" 
## [13] "koi_time0bk_err2"  "koi_impact"        "koi_impact_err1"  
## [16] "koi_impact_err2"   "koi_duration"      "koi_duration_err1"
## [19] "koi_duration_err2" "koi_depth"         "koi_depth_err1"   
## [22] "koi_depth_err2"    "koi_prad"          "koi_prad_err1"    
## [25] "koi_prad_err2"     "koi_teq"           "koi_insol"        
## [28] "koi_insol_err1"    "koi_insol_err2"    "koi_model_snr"    
## [31] "koi_steff"         "koi_steff_err1"    "koi_steff_err2"   
## [34] "koi_slogg"         "koi_slogg_err1"    "koi_slogg_err2"   
## [37] "koi_srad"          "koi_srad_err1"     "koi_srad_err2"    
## [40] "ra"                "dec"               "koi_kepmag"

Supervised Learning

1) Can we determine the classification system for exoplanet candidates (koi_disposition)?

The koi_disposition has three different categories as mentioned above: “CANDIDATE”," “CONFIRMED”, “FALSE POSITIVE”. Blank values are classified as “NOT DISPOSITIONED” which will be ignored. These are the results from historical dispositions in literature for exoplanet candidates. KOI means Kepler’s “object of interest” which is comprised of the planets that Kepler has found. The objective of the model is to predict whether a KOI is a candidate, confirmed or false positive.

For this problem we conducted the following data science models: - Decision Tree - Randomforest - KNN - SVM - Neural Network Linear regression was ignored since there were many categorical variables. Logistic regression was ignored since there are three different categories of koi_disposition which makes it complex to analyze.

Decision Tree

decTreeModel <- rpart(koi_disposition ~ .,data=trainingSet,method = "class")
prp(decTreeModel)

plotcp(decTreeModel)

pruned_decTreeModel = prune(decTreeModel, cp=0.012)
prp(pruned_decTreeModel)

As shown above, the most imporant factors and characteristics in determining the classification of a star is ranked as follows: 1. koi_fpflag_s 2. koi_fpflag_n 3. koi_fpflag_C 4. koi_model_sn 5. koi_fpflag_e 6. koi_prad_err The decision tree makes it very easy to understand and visualize the important aspects in this problem. It was very first fast to implement as it can handle both categorical and numerical data.

predictedLabels<-predict(pruned_decTreeModel, testingSet, type = "class")
sizeTestSet = dim(testingSet)[1]
error = sum(predictedLabels != testingSet$koi_disposition)
misclassification_rate = error/sizeTestSet
print(misclassification_rate)
## [1] 0.1306848

Random Forest

RandForestModel <- randomForest(koi_disposition ~ .,data=trainingSet)
plot(RandForestModel)
legend("top", colnames(RandForestModel$err.rate),fill=1:3)

predictedLabels<-predict(RandForestModel, testingSet)
sizeTestSet = dim(testingSet)[1]

error = sum(predictedLabels != testingSet$koi_disposition)

misclassification_rate = error/sizeTestSet

print(misclassification_rate)
## [1] 0.1024569

The randomforest model had a lower misclassification rate than the decision tree. As we learned in class, decision trees are prone to overfitting. Randomforest models mititages overfitting and can lead to more accurate classification and prediction which is seen in this case.

KNN Model

Normalize all data

starData[8:42] <- scale(starData[8:42])

change koi_tce_delivname into numerical values

levels(starData$koi_tce_delivname)
## [1] "q1_q16_tce"      "q1_q17_dr24_tce" "q1_q17_dr25_tce"
levels(starData$koi_tce_delivname)[levels(starData$koi_tce_delivname)=="q1_q16_tce"] <- "1"
levels(starData$koi_tce_delivname)[levels(starData$koi_tce_delivname)=="q1_q17_dr24_tce"] <- "2"
levels(starData$koi_tce_delivname)[levels(starData$koi_tce_delivname)=="q1_q17_dr25_tce"] <- "3"


starData$koi_tce_delivname[starData$koi_tce_delivname== "q1_q16_tce"] <- "1"
starData$koi_tce_delivname[starData$koi_tce_delivname== "q1_q17_dr24_tce"] <- "2"
starData$koi_tce_delivname[starData$koi_tce_delivname== "q1_q17_dr25_tce"] <- "3"
levels(starData$koi_tce_delivname)
## [1] "1" "2" "3"
num_samples = dim(starData)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
trainingSet <- starData[training, ]
testing <- setdiff(1:num_samples,training)
testingSet <- starData[testing, ]
trainingfeatures <- subset(trainingSet, select=c(-koi_disposition))

traininglabels <- trainingSet$koi_disposition

testingfeatures <- subset(testingSet, select=c(-koi_disposition))
currentBestError = Inf
currentBestVar = -1
for(i in 1:30) { 
  predictedLabels = knn(trainingfeatures,testingfeatures,traininglabels,k=i)
  error = sum(predictedLabels != testingSet$koi_disposition)
  if(error < currentBestError){
    print(paste0("We found a better k: ",i))
    currentBestError = error 
    currentBestVar = i
  }
}
## [1] "We found a better k: 1"
## [1] "We found a better k: 3"
## [1] "We found a better k: 5"
## [1] "We found a better k: 6"
## [1] "We found a better k: 19"
## [1] "We found a better k: 23"
currentBestVar
## [1] 23
currentBestError / (dim(testingSet)[1])
## [1] 0.2263461

KNN cross fold validation

AllErrors=c()
for(fold in 1:50)
{
  #Get Training at Testing sets
  num_samples = dim(starData)[1]
  sampling.rate = 0.8
  training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE)
  trainingSet <- starData[training, ]
  testing <- setdiff(1:num_samples,training)
  testingSet <- starData[testing, ]
  
  trainingfeatures <- subset(trainingSet, select=c(-koi_disposition))

  traininglabels <- trainingSet$koi_disposition

  testingfeatures <- subset(testingSet, select=c(-koi_disposition))
 
  predictedLabels = knn(trainingfeatures,testingfeatures,traininglabels,k=currentBestVar)
    
  
  error = sum(predictedLabels != testingSet$koi_disposition)
  errorRate <- error / (dim(testingSet)[1])
  AllErrors[fold] = errorRate
}
AverageError = mean(AllErrors)
AverageError
## [1] 0.2295975

By conducting a KNN cross validation analysis, we are able to find the average error which is a more accurate result than doing one test.

SVM

starData$koi_disposition <- as.factor(starData$koi_disposition)
levels(starData$koi_disposition)
## [1] "CANDIDATE"      "CONFIRMED"      "FALSE POSITIVE"

SVM Linear

This model takes a long calculation time, so insert print(i) to check the progress ideally, we would increase the range of cost testing, but to consider the process time, we choose 15-20 to demonstrate the concept. Feel free to edit the range for for-loop during assessment

currentBestError = Inf
currentBestVar = -1
for(i in 15:20) {
  svmModel <- svm(koi_disposition~., data=trainingSet, kernel="linear", cost=i)
  error = sum(predictedLabels != testingSet$koi_disposition)
  print(i)
  if(error < currentBestError){
    print(paste0("We found a better cost: ",i))
    currentBestError = error 
    currentBestVar = i
  }
}
## [1] 15
## [1] "We found a better cost: 15"
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
currentBestVar
## [1] 15
currentBestError / (dim(testingSet)[1])
## [1] 0.2368008
currentBestError = Inf
currentBestVar = -1
for(i in 15:20) {
  svmModel <- svm(koi_disposition~., data=trainingSet, kernel="polynomial", cost=i)
  error = sum(predictedLabels != testingSet$koi_disposition)
  print(i)
  if(error < currentBestError){
    print(paste0("We found a better cost: ",i))
    currentBestError = error 
    currentBestVar = i
  }
}
## [1] 15
## [1] "We found a better cost: 15"
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
currentBestVar
## [1] 15
currentBestError / (dim(testingSet)[1])
## [1] 0.2368008
currentBestError = Inf
currentBestVar = -1
for(i in 15:20) {
  svmModel <- svm(koi_disposition~., data=trainingSet, kernel="radial", cost=i)
  error = sum(predictedLabels != testingSet$koi_disposition)
  print(i)
  if(error < currentBestError){
    print(paste0("We found a better cost: ",i))
    currentBestError = error 
    currentBestVar = i
  }
}
## [1] 15
## [1] "We found a better cost: 15"
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
currentBestVar
## [1] 15
currentBestError / (dim(testingSet)[1])
## [1] 0.2368008

Neural Network

Using a more complex machine learning algorithm,

head(starData)
##   koi_disposition koi_fpflag_nt koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1       CONFIRMED             0             0             0             0
## 2       CONFIRMED             0             0             0             0
## 3  FALSE POSITIVE             0             1             0             0
## 4  FALSE POSITIVE             0             1             0             0
## 5       CONFIRMED             0             0             0             0
## 6       CONFIRMED             0             0             0             0
##   koi_tce_plnt_num koi_tce_delivname  koi_period koi_period_err1
## 1                1                 3 -0.04958503      -0.2637455
## 2                2                 3 -0.01592288      -0.2363585
## 3                1                 3 -0.04178495      -0.2653391
## 4                1                 3 -0.05539220      -0.2671649
## 5                1                 3 -0.05480134      -0.2667298
## 6                1                 3 -0.04838159      -0.2646648
##   koi_period_err2 koi_time0bk koi_time0bk_err1 koi_time0bk_err2
## 1       0.2637455  0.06412788       -0.3447962        0.3447962
## 2       0.2363585 -0.05402631       -0.2844657        0.2844657
## 3       0.2653391  0.14233141       -0.4148417        0.4148417
## 4       0.2671649  0.06072405       -0.4355138        0.4355138
## 5       0.2667298  0.07968760       -0.3904877        0.3904877
## 6       0.2646648  0.07388083       -0.3780667        0.3780667
##    koi_impact koi_impact_err1 koi_impact_err2 koi_duration
## 1 -0.17935061      -0.1785545      0.15294104   -0.4116641
## 2 -0.04539452      -0.2067211     -0.09054152   -0.1722317
## 3  0.07120817       0.3443219      0.20950769   -0.5932743
## 4  0.16467299      -0.2006311      0.19721059   -0.4968199
## 5 -0.01038326      -0.1875809     -0.11923475   -0.6130068
## 6 -0.06000791      -0.2098749     -0.07824442   -0.1587109
##   koi_duration_err1 koi_duration_err2  koi_depth koi_depth_err1
## 1        -0.3947219         0.3947219 -0.2873000    -0.02583522
## 2        -0.3425597         0.3425597 -0.2840893    -0.02184898
## 3        -0.4678408         0.4678408 -0.1606901     0.01190950
## 4        -0.5117886         0.5117886 -0.1947785    -0.02750446
## 5        -0.4557563         0.4557563 -0.2874550    -0.02648299
## 6        -0.4266923         0.4266923 -0.2761219    -0.02466426
##   koi_depth_err2    koi_prad koi_prad_err1 koi_prad_err2    koi_teq
## 1     0.02583522 -0.03333655   -0.04534862    0.02808129 -0.3481029
## 2     0.02184898 -0.03314772   -0.04519222    0.02804712 -0.7647988
## 3    -0.01190950 -0.02924864   -0.03580850    0.02709038 -0.5326397
## 4     0.02750446 -0.02300084   -0.02387032    0.02579196  0.3686142
## 5     0.02648299 -0.03317422   -0.04373253    0.02791044  0.3817104
## 6     0.02466426 -0.03279326   -0.04271596    0.02785064 -0.2980993
##     koi_insol koi_insol_err1 koi_insol_err2 koi_model_snr  koi_steff
## 1 -0.04889243    -0.06876874     0.04634332    -0.2870964 -0.3221945
## 2 -0.04943220    -0.06925994     0.04651629    -0.2999078 -0.3221945
## 3 -0.04923931    -0.06873936     0.04641421    -0.2352104  0.1870253
## 4 -0.04379134    -0.05695077     0.04388395     0.3147818  0.1256119
## 5 -0.04357283    -0.05315534     0.04291850    -0.2805626  0.4147669
## 6 -0.04875685    -0.06722751     0.04611257    -0.2477655  0.4339586
##   koi_steff_err1 koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2
## 1     -1.3868026      1.1464281 0.3696378     -0.4379769      0.5657535
## 2     -1.3868026      1.1464281 0.3696378     -0.4379769      0.5657535
## 3      0.2912499     -0.1937625 0.5511062     -0.5923628     -0.3939537
## 4      0.2694570     -0.1655480 0.5982408     -0.5228891     -0.2979830
## 5      0.5309717     -0.6875169 0.3012926     -0.3916611     -0.8018293
## 6      0.9668295     -0.9837696 0.4144157     -0.5151698     -1.0297598
##     koi_srad koi_srad_err1 koi_srad_err2          ra      dec koi_kepmag
## 1 -0.1334014   -0.28342173     0.1578657 -0.02641956 1.202701  0.7813000
## 2 -0.1334014   -0.28342173     0.1578657 -0.02641956 1.202701  0.7813000
## 3 -0.1432187   -0.14242259     0.1498259  1.03734270 1.200612  0.8455425
## 4 -0.1560312   -0.17767238     0.1550281 -1.36899986 1.242565  0.9617565
## 5 -0.1136003   -0.03116546     0.1238150 -0.69341739 1.226179  0.8982358
## 6 -0.1259136   -0.05209502     0.1370569  0.88656827 1.225754  1.0462101
levels(starData$koi_disposition)
## [1] "CANDIDATE"      "CONFIRMED"      "FALSE POSITIVE"
levels(starData$koi_disposition)[levels(starData$koi_disposition)=="CANDIDATE"] <- "1"
levels(starData$koi_disposition)[levels(starData$koi_disposition)== "CONFIRMED"] <- "2"
levels(starData$koi_disposition)[levels(starData$koi_disposition)== "FALSE POSITIVE"] <- "3"

starData$koi_disposition[starData$koi_disposition=="CANDIDATE"] <- "1"
starData$koi_disposition[starData$koi_disposition== "CONFIRMED"] <- "2"
starData$koi_disposition[starData$koi_disposition== "FALSE POSITIVE"] <- "3"
starData$koi_disposition <- as.numeric(starData$koi_disposition)
str(starData)
## 'data.frame':    9564 obs. of  42 variables:
##  $ koi_disposition  : num  2 2 3 3 2 2 2 2 3 2 ...
##  $ koi_fpflag_nt    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ koi_fpflag_ss    : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...
##  $ koi_fpflag_co    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ koi_fpflag_ec    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ koi_tce_plnt_num : Factor w/ 8 levels "1","2","3","4",..: 1 2 1 1 1 1 2 3 1 1 ...
##  $ koi_tce_delivname: Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
##  $ koi_period       : num  -0.0496 -0.0159 -0.0418 -0.0554 -0.0548 ...
##  $ koi_period_err1  : num  -0.264 -0.236 -0.265 -0.267 -0.267 ...
##  $ koi_period_err2  : num  0.264 0.236 0.265 0.267 0.267 ...
##  $ koi_time0bk      : num  0.0641 -0.054 0.1423 0.0607 0.0797 ...
##  $ koi_time0bk_err1 : num  -0.345 -0.284 -0.415 -0.436 -0.39 ...
##  $ koi_time0bk_err2 : num  0.345 0.284 0.415 0.436 0.39 ...
##  $ koi_impact       : num  -0.1794 -0.0454 0.0712 0.1647 -0.0104 ...
##  $ koi_impact_err1  : num  -0.179 -0.207 0.344 -0.201 -0.188 ...
##  $ koi_impact_err2  : num  0.1529 -0.0905 0.2095 0.1972 -0.1192 ...
##  $ koi_duration     : num  -0.412 -0.172 -0.593 -0.497 -0.613 ...
##  $ koi_duration_err1: num  -0.395 -0.343 -0.468 -0.512 -0.456 ...
##  $ koi_duration_err2: num  0.395 0.343 0.468 0.512 0.456 ...
##  $ koi_depth        : num  -0.287 -0.284 -0.161 -0.195 -0.287 ...
##  $ koi_depth_err1   : num  -0.0258 -0.0218 0.0119 -0.0275 -0.0265 ...
##  $ koi_depth_err2   : num  0.0258 0.0218 -0.0119 0.0275 0.0265 ...
##  $ koi_prad         : num  -0.0333 -0.0331 -0.0292 -0.023 -0.0332 ...
##  $ koi_prad_err1    : num  -0.0453 -0.0452 -0.0358 -0.0239 -0.0437 ...
##  $ koi_prad_err2    : num  0.0281 0.028 0.0271 0.0258 0.0279 ...
##  $ koi_teq          : num  -0.348 -0.765 -0.533 0.369 0.382 ...
##  $ koi_insol        : num  -0.0489 -0.0494 -0.0492 -0.0438 -0.0436 ...
##  $ koi_insol_err1   : num  -0.0688 -0.0693 -0.0687 -0.057 -0.0532 ...
##  $ koi_insol_err2   : num  0.0463 0.0465 0.0464 0.0439 0.0429 ...
##  $ koi_model_snr    : num  -0.287 -0.3 -0.235 0.315 -0.281 ...
##  $ koi_steff        : num  -0.322 -0.322 0.187 0.126 0.415 ...
##  $ koi_steff_err1   : num  -1.387 -1.387 0.291 0.269 0.531 ...
##  $ koi_steff_err2   : num  1.146 1.146 -0.194 -0.166 -0.688 ...
##  $ koi_slogg        : num  0.37 0.37 0.551 0.598 0.301 ...
##  $ koi_slogg_err1   : num  -0.438 -0.438 -0.592 -0.523 -0.392 ...
##  $ koi_slogg_err2   : num  0.566 0.566 -0.394 -0.298 -0.802 ...
##  $ koi_srad         : num  -0.133 -0.133 -0.143 -0.156 -0.114 ...
##  $ koi_srad_err1    : num  -0.2834 -0.2834 -0.1424 -0.1777 -0.0312 ...
##  $ koi_srad_err2    : num  0.158 0.158 0.15 0.155 0.124 ...
##  $ ra               : num  -0.0264 -0.0264 1.0373 -1.369 -0.6934 ...
##  $ dec              : num  1.2 1.2 1.2 1.24 1.23 ...
##  $ koi_kepmag       : num  0.781 0.781 0.846 0.962 0.898 ...
starData$koi_fpflag_nt <- as.numeric(starData$koi_fpflag_nt)
starData$koi_fpflag_nt <- as.numeric(starData$koi_fpflag_nt)
starData$koi_fpflag_co <- as.numeric(starData$koi_fpflag_co)
starData$koi_fpflag_ec <- as.numeric(starData$koi_fpflag_ec)
starData$koi_fpflag_ss <- as.numeric(starData$koi_fpflag_ss)
starData$koi_tce_delivname <- as.numeric(starData$koi_tce_delivname)
starData$koi_tce_plnt_num <- as.numeric(starData$koi_tce_plnt_num)
str(starData)
## 'data.frame':    9564 obs. of  42 variables:
##  $ koi_disposition  : num  2 2 3 3 2 2 2 2 3 2 ...
##  $ koi_fpflag_nt    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ koi_fpflag_ss    : num  1 1 2 2 1 1 1 1 2 1 ...
##  $ koi_fpflag_co    : num  1 1 1 1 1 1 1 1 2 1 ...
##  $ koi_fpflag_ec    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ koi_tce_plnt_num : num  1 2 1 1 1 1 2 3 1 1 ...
##  $ koi_tce_delivname: num  3 3 3 3 3 3 3 3 3 3 ...
##  $ koi_period       : num  -0.0496 -0.0159 -0.0418 -0.0554 -0.0548 ...
##  $ koi_period_err1  : num  -0.264 -0.236 -0.265 -0.267 -0.267 ...
##  $ koi_period_err2  : num  0.264 0.236 0.265 0.267 0.267 ...
##  $ koi_time0bk      : num  0.0641 -0.054 0.1423 0.0607 0.0797 ...
##  $ koi_time0bk_err1 : num  -0.345 -0.284 -0.415 -0.436 -0.39 ...
##  $ koi_time0bk_err2 : num  0.345 0.284 0.415 0.436 0.39 ...
##  $ koi_impact       : num  -0.1794 -0.0454 0.0712 0.1647 -0.0104 ...
##  $ koi_impact_err1  : num  -0.179 -0.207 0.344 -0.201 -0.188 ...
##  $ koi_impact_err2  : num  0.1529 -0.0905 0.2095 0.1972 -0.1192 ...
##  $ koi_duration     : num  -0.412 -0.172 -0.593 -0.497 -0.613 ...
##  $ koi_duration_err1: num  -0.395 -0.343 -0.468 -0.512 -0.456 ...
##  $ koi_duration_err2: num  0.395 0.343 0.468 0.512 0.456 ...
##  $ koi_depth        : num  -0.287 -0.284 -0.161 -0.195 -0.287 ...
##  $ koi_depth_err1   : num  -0.0258 -0.0218 0.0119 -0.0275 -0.0265 ...
##  $ koi_depth_err2   : num  0.0258 0.0218 -0.0119 0.0275 0.0265 ...
##  $ koi_prad         : num  -0.0333 -0.0331 -0.0292 -0.023 -0.0332 ...
##  $ koi_prad_err1    : num  -0.0453 -0.0452 -0.0358 -0.0239 -0.0437 ...
##  $ koi_prad_err2    : num  0.0281 0.028 0.0271 0.0258 0.0279 ...
##  $ koi_teq          : num  -0.348 -0.765 -0.533 0.369 0.382 ...
##  $ koi_insol        : num  -0.0489 -0.0494 -0.0492 -0.0438 -0.0436 ...
##  $ koi_insol_err1   : num  -0.0688 -0.0693 -0.0687 -0.057 -0.0532 ...
##  $ koi_insol_err2   : num  0.0463 0.0465 0.0464 0.0439 0.0429 ...
##  $ koi_model_snr    : num  -0.287 -0.3 -0.235 0.315 -0.281 ...
##  $ koi_steff        : num  -0.322 -0.322 0.187 0.126 0.415 ...
##  $ koi_steff_err1   : num  -1.387 -1.387 0.291 0.269 0.531 ...
##  $ koi_steff_err2   : num  1.146 1.146 -0.194 -0.166 -0.688 ...
##  $ koi_slogg        : num  0.37 0.37 0.551 0.598 0.301 ...
##  $ koi_slogg_err1   : num  -0.438 -0.438 -0.592 -0.523 -0.392 ...
##  $ koi_slogg_err2   : num  0.566 0.566 -0.394 -0.298 -0.802 ...
##  $ koi_srad         : num  -0.133 -0.133 -0.143 -0.156 -0.114 ...
##  $ koi_srad_err1    : num  -0.2834 -0.2834 -0.1424 -0.1777 -0.0312 ...
##  $ koi_srad_err2    : num  0.158 0.158 0.15 0.155 0.124 ...
##  $ ra               : num  -0.0264 -0.0264 1.0373 -1.369 -0.6934 ...
##  $ dec              : num  1.2 1.2 1.2 1.24 1.23 ...
##  $ koi_kepmag       : num  0.781 0.781 0.846 0.962 0.898 ...
num_samples = dim(starData)[1]
sampling.rate = 0.8
training <- sample(1:num_samples, sampling.rate * num_samples, replace=FALSE) 
trainingSet <- subset(starData[training, ])
testing <- setdiff(1:num_samples,training)
testingSet <- subset(starData[testing, ])
n <- names(starData)
f <- as.formula(paste("koi_disposition ~", paste(n[!n %in% "koi_disposition"], collapse = " + ")))
f
## koi_disposition ~ koi_fpflag_nt + koi_fpflag_ss + koi_fpflag_co + 
##     koi_fpflag_ec + koi_tce_plnt_num + koi_tce_delivname + koi_period + 
##     koi_period_err1 + koi_period_err2 + koi_time0bk + koi_time0bk_err1 + 
##     koi_time0bk_err2 + koi_impact + koi_impact_err1 + koi_impact_err2 + 
##     koi_duration + koi_duration_err1 + koi_duration_err2 + koi_depth + 
##     koi_depth_err1 + koi_depth_err2 + koi_prad + koi_prad_err1 + 
##     koi_prad_err2 + koi_teq + koi_insol + koi_insol_err1 + koi_insol_err2 + 
##     koi_model_snr + koi_steff + koi_steff_err1 + koi_steff_err2 + 
##     koi_slogg + koi_slogg_err1 + koi_slogg_err2 + koi_srad + 
##     koi_srad_err1 + koi_srad_err2 + ra + dec + koi_kepmag
nnModel <- neuralnet(f, data=trainingSet, hidden=c(7,5,3), linear.output=FALSE)
plot(nnModel)
predictedLabels <-compute(nnModel, testingSet[,2:42])
predictedLabels<-round(predictedLabels$net.result)
sizeTestSet = dim(testingSet)[1]
error = sum(predictedLabels != testingSet$koi_disposition)
misclassification_rate = error/sizeTestSet
print(misclassification_rate)
## [1] 0.7600627287

After conducting all of the models, the most accurate model is the randomforest model since it had the lowest misclassification rate of approximately 11%.

Unsupervised Learning

k-Means Clustering: Planet Categorization

One of the more major investigations we conducted was for an unsupervised learning problem: Given the planet data available, could we use unsupervised learning methods to come up with some planet categorization scheme? And, assuming we have a viable scheme, does our categorization scheme match any of those which astronomers have already devised?

The tool we wanted to use was k-means clustering. We first needed to decide which planets we would like to perform clustering on. and settled on planets with a koi_score of at least 0.8, because we wanted to be reasonably certain that the observations we included were planets. This still left us with a decent number of datapoints to work with.

# Verify there is an adequate volume of data after proposed koi_score filtering.
num_pts <- sum(kepler_df$koi_score >= 0.8 & !is.na(kepler_df$koi_score))
ggplot(kepler_df, aes(x = koi_score, fill = koi_pdisposition)) + geom_histogram() + geom_vline(xintercept=0.8, colour="orange", linetype = "longdash") + annotate("text", x = 0.7, y = 2000, label = "koi_score\ncutoff")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1510 rows containing non-finite values (stat_bin).

print(paste("The number of data points for koi_score >= 0.8:", num_pts))
## [1] "The number of data points for koi_score >= 0.8: 3682"
planets_df <- subset(kepler_df, koi_score >= 0.8)
head(planets_df)
##   kepoi_name  kepler_name koi_disposition koi_pdisposition koi_score
## 1  K00752.01 Kepler-227 b       CONFIRMED        CANDIDATE     1.000
## 2  K00752.02 Kepler-227 c       CONFIRMED        CANDIDATE     0.969
## 5  K00755.01 Kepler-664 b       CONFIRMED        CANDIDATE     1.000
## 6  K00756.01 Kepler-228 d       CONFIRMED        CANDIDATE     1.000
## 7  K00756.02 Kepler-228 c       CONFIRMED        CANDIDATE     1.000
## 8  K00756.03 Kepler-228 b       CONFIRMED        CANDIDATE     0.992
##   koi_fpflag_nt           koi_fpflag_ss koi_fpflag_co koi_fpflag_ec
## 1             0 No binary star detected             0             0
## 2             0 No binary star detected             0             0
## 5             0 No binary star detected             0             0
## 6             0 No binary star detected             0             0
## 7             0 No binary star detected             0             0
## 8             0 No binary star detected             0             0
##     koi_period koi_period_err1 koi_period_err2 koi_time0bk
## 1  9.488035570     0.000027750    -0.000027750   170.53875
## 2 54.418382700     0.000247900    -0.000247900   162.51384
## 5  2.525591777     0.000003761    -0.000003761   171.59555
## 6 11.094320540     0.000020360    -0.000020360   171.20116
## 7  4.134435120     0.000010460    -0.000010460   172.97937
## 8  2.566588970     0.000017810    -0.000017810   179.55437
##   koi_time0bk_err1 koi_time0bk_err2 koi_impact koi_impact_err1
## 1          0.00216         -0.00216      0.146           0.318
## 2          0.00352         -0.00352      0.586           0.059
## 5          0.00113         -0.00113      0.701           0.235
## 6          0.00141         -0.00141      0.538           0.030
## 7          0.00190         -0.00190      0.762           0.139
## 8          0.00461         -0.00461      0.755           0.212
##   koi_impact_err2 koi_duration koi_duration_err1 koi_duration_err2
## 1          -0.146       2.9575            0.0819           -0.0819
## 2          -0.443       4.5070            0.1160           -0.1160
## 5          -0.478       1.6545            0.0420           -0.0420
## 6          -0.428       4.5945            0.0610           -0.0610
## 7          -0.532       3.1402            0.0673           -0.0673
## 8          -0.523       2.4290            0.1650           -0.1650
##   koi_depth koi_depth_err1 koi_depth_err2 koi_prad koi_prad_err1
## 1     615.8           19.5          -19.5     2.26          0.26
## 2     874.8           35.5          -35.5     2.83          0.32
## 5     603.3           16.9          -16.9     2.75          0.88
## 6    1517.5           24.2          -24.2     3.90          1.27
## 7     686.0           18.7          -18.7     2.77          0.90
## 8     226.5           16.8          -16.8     1.59          0.52
##   koi_prad_err2 koi_teq koi_insol koi_insol_err1 koi_insol_err2
## 1         -0.15     793     93.59          29.45         -16.65
## 2         -0.19     443      9.11           2.87          -1.62
## 5         -0.35    1406    926.16         874.33        -314.24
## 6         -0.42     835    114.81         112.85         -36.70
## 7         -0.30    1160    427.65         420.33        -136.70
## 8         -0.17    1360    807.74         793.91        -258.20
##   koi_model_snr koi_tce_plnt_num koi_tce_delivname koi_steff
## 1          35.8                1   q1_q17_dr25_tce      5455
## 2          25.8                2   q1_q17_dr25_tce      5455
## 5          40.9                1   q1_q17_dr25_tce      6031
## 6          66.5                1   q1_q17_dr25_tce      6046
## 7          40.2                2   q1_q17_dr25_tce      6046
## 8          15.0                3   q1_q17_dr25_tce      6046
##   koi_steff_err1 koi_steff_err2 koi_slogg koi_slogg_err1 koi_slogg_err2
## 1             81            -81     4.467          0.064         -0.096
## 2             81            -81     4.467          0.064         -0.096
## 5            169           -211     4.438          0.070         -0.210
## 6            189           -232     4.486          0.054         -0.229
## 7            189           -232     4.486          0.054         -0.229
## 8            189           -232     4.486          0.054         -0.229
##   koi_srad koi_srad_err1 koi_srad_err2        ra       dec koi_kepmag
## 1    0.927         0.105        -0.061 291.93423 48.141651     15.347
## 2    0.927         0.105        -0.061 291.93423 48.141651     15.347
## 5    1.046         0.334        -0.133 288.75488 48.226200     15.509
## 6    0.972         0.315        -0.105 296.28613 48.224670     15.714
## 7    0.972         0.315        -0.105 296.28613 48.224670     15.714
## 8    0.972         0.315        -0.105 296.28613 48.224670     15.714
##          koi_els
## 1 Not Earth-like
## 2 Not Earth-like
## 5 Not Earth-like
## 6 Not Earth-like
## 7 Not Earth-like
## 8 Not Earth-like
str(planets_df)
## 'data.frame':    3682 obs. of  47 variables:
##  $ kepoi_name       : Factor w/ 9564 levels "K00001.01","K00002.01",..: 1081 1082 1085 1086 1087 1088 1089 1 2 11 ...
##  $ kepler_name      : Factor w/ 2294 levels "Kepler-1 b","Kepler-10 b",..: 1036 1037 1868 1040 1039 1038 1042 1 954 2031 ...
##  $ koi_disposition  : Factor w/ 3 levels "CANDIDATE","CONFIRMED",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ koi_pdisposition : Factor w/ 2 levels "CANDIDATE","FALSE POSITIVE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ koi_score        : num  1 0.969 1 1 1 0.992 1 0.811 1 0.998 ...
##  $ koi_fpflag_nt    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_fpflag_ss    : chr  "No binary star detected" "No binary star detected" "No binary star detected" "No binary star detected" ...
##  $ koi_fpflag_co    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_fpflag_ec    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ koi_period       : num  9.49 54.42 2.53 11.09 4.13 ...
##  $ koi_period_err1  : num  0.00002775 0.0002479 0.00000376 0.00002036 0.00001046 ...
##  $ koi_period_err2  : num  -0.00002775 -0.0002479 -0.00000376 -0.00002036 -0.00001046 ...
##  $ koi_time0bk      : num  171 163 172 171 173 ...
##  $ koi_time0bk_err1 : num  0.00216 0.00352 0.00113 0.00141 0.0019 0.00461 0.000517 0.0000087 0.000016 0.0000471 ...
##  $ koi_time0bk_err2 : num  -0.00216 -0.00352 -0.00113 -0.00141 -0.0019 -0.00461 -0.000517 -0.0000087 -0.000016 -0.0000471 ...
##  $ koi_impact       : num  0.146 0.586 0.701 0.538 0.762 0.755 0.052 0.818 0.224 0.631 ...
##  $ koi_impact_err1  : num  0.318 0.059 0.235 0.03 0.139 0.212 0.262 0.001 0.159 0.007 ...
##  $ koi_impact_err2  : num  -0.146 -0.443 -0.478 -0.428 -0.532 -0.523 -0.052 -0.001 -0.216 -0.007 ...
##  $ koi_duration     : num  2.96 4.51 1.65 4.59 3.14 ...
##  $ koi_duration_err1: num  0.0819 0.116 0.042 0.061 0.0673 0.165 0.0241 0.00107 0.00203 0.00653 ...
##  $ koi_duration_err2: num  -0.0819 -0.116 -0.042 -0.061 -0.0673 -0.165 -0.0241 -0.00107 -0.00203 -0.00653 ...
##  $ koi_depth        : num  616 875 603 1518 686 ...
##  $ koi_depth_err1   : num  19.5 35.5 16.9 24.2 18.7 16.8 33.3 4.2 1.7 6.6 ...
##  $ koi_depth_err2   : num  -19.5 -35.5 -16.9 -24.2 -18.7 -16.8 -33.3 -4.2 -1.7 -6.6 ...
##  $ koi_prad         : num  2.26 2.83 2.75 3.9 2.77 ...
##  $ koi_prad_err1    : num  0.26 0.32 0.88 1.27 0.9 0.52 0.22 0.51 0.81 1.11 ...
##  $ koi_prad_err2    : num  -0.15 -0.19 -0.35 -0.42 -0.3 -0.17 -0.49 -0.51 -0.91 -1.11 ...
##  $ koi_teq          : num  793 443 1406 835 1160 ...
##  $ koi_insol        : num  93.59 9.11 926.16 114.81 427.65 ...
##  $ koi_insol_err1   : num  29.45 2.87 874.33 112.85 420.33 ...
##  $ koi_insol_err2   : num  -16.65 -1.62 -314.24 -36.7 -136.7 ...
##  $ koi_model_snr    : num  35.8 25.8 40.9 66.5 40.2 ...
##  $ koi_tce_plnt_num : int  1 2 1 1 2 3 1 1 1 1 ...
##  $ koi_tce_delivname: Factor w/ 3 levels "q1_q16_tce","q1_q17_dr24_tce",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ koi_steff        : num  5455 5455 6031 6046 6046 ...
##  $ koi_steff_err1   : num  81 81 169 189 189 189 75 78 76 112 ...
##  $ koi_steff_err2   : num  -81 -81 -211 -232 -232 -232 -83 -78 -89 -137 ...
##  $ koi_slogg        : num  4.47 4.47 4.44 4.49 4.49 ...
##  $ koi_slogg_err1   : num  0.064 0.064 0.07 0.054 0.054 0.054 0.083 0.024 0.033 0.055 ...
##  $ koi_slogg_err2   : num  -0.096 -0.096 -0.21 -0.229 -0.229 -0.229 -0.028 -0.024 -0.027 -0.045 ...
##  $ koi_srad         : num  0.927 0.927 1.046 0.972 0.972 ...
##  $ koi_srad_err1    : num  0.105 0.105 0.334 0.315 0.315 0.315 0.033 0.038 0.099 0.11 ...
##  $ koi_srad_err2    : num  -0.061 -0.061 -0.133 -0.105 -0.105 -0.105 -0.072 -0.038 -0.11 -0.11 ...
##  $ ra               : num  292 292 289 296 296 ...
##  $ dec              : num  48.1 48.1 48.2 48.2 48.2 ...
##  $ koi_kepmag       : num  15.3 15.3 15.5 15.7 15.7 ...
##  $ koi_els          : chr  "Not Earth-like" "Not Earth-like" "Not Earth-like" "Not Earth-like" ...

For clustering we needed to narrow down the appropriate feature set. Three criteria were used to narrow down the feature set: 1. Ignore non-numeric features (since we’ll use Euclidean distances for clustering). 2. Ignore data that have no relation to the physical features of the planet. 3. Ignore redundant data.

# Vector of features to include
keep <- c("koi_period","koi_impact","koi_duration","koi_depth","koi_prad","koi_teq")
# Create dataframe for k-means
plnts_clst_df <- planets_df[,keep]
plnts_clst_df$koi_teq <- as.numeric(plnts_clst_df$koi_teq)
# Remove "NA" rows
plnts_clst_df <- subset(plnts_clst_df, !is.na(koi_teq))
head(plnts_clst_df)
##     koi_period koi_impact koi_duration koi_depth koi_prad koi_teq
## 1  9.488035570      0.146       2.9575     615.8     2.26     793
## 2 54.418382700      0.586       4.5070     874.8     2.83     443
## 5  2.525591777      0.701       1.6545     603.3     2.75    1406
## 6 11.094320540      0.538       4.5945    1517.5     3.90     835
## 7  4.134435120      0.762       3.1402     686.0     2.77    1160
## 8  2.566588970      0.755       2.4290     226.5     1.59    1360
str(plnts_clst_df)
## 'data.frame':    3679 obs. of  6 variables:
##  $ koi_period  : num  9.49 54.42 2.53 11.09 4.13 ...
##  $ koi_impact  : num  0.146 0.586 0.701 0.538 0.762 0.755 0.052 0.818 0.224 0.631 ...
##  $ koi_duration: num  2.96 4.51 1.65 4.59 3.14 ...
##  $ koi_depth   : num  616 875 603 1518 686 ...
##  $ koi_prad    : num  2.26 2.83 2.75 3.9 2.77 ...
##  $ koi_teq     : num  793 443 1406 835 1160 ...

Feature normalization:

# Normalize using z scores:
norm_plnts_clst_df <- plnts_clst_df
for (i in 1:dim(norm_plnts_clst_df)[2]) {
  norm_plnts_clst_df[,i] <- (norm_plnts_clst_df[,i]-mean(norm_plnts_clst_df[,i]))/sd(norm_plnts_clst_df[,i])
}
head(norm_plnts_clst_df)
##      koi_period    koi_impact  koi_duration      koi_depth       koi_prad
## 1 -0.3609254105 -0.8825612563 -0.4096494530 -0.13863335277 -0.13065565704
## 2  0.5586232648  0.3880442541  0.1128989621 -0.06735913965 -0.07971832878
## 5 -0.5034194345  0.7201343307 -0.8490689980 -0.14207322792 -0.08686742748
## 6 -0.3280510324  0.2494327439  0.1424071817  0.10950548110  0.01590086638
## 7 -0.4704926966  0.8962864583 -0.3480362904 -0.11931501392 -0.08508015280
## 8 -0.5025803822  0.8760722798 -0.5878790996 -0.24576482446 -0.19052935868
##         koi_teq
## 1 -0.1861045735
## 2 -0.8533287157
## 5  0.9824908527
## 6 -0.1060376764
## 7  0.5135275985
## 8  0.8947985369
totalWithnss = c()
betweenss = c()
withinss <- c()
# Use k folds to acheive stability because centroids are selected randomly
for(clusters in 2:80)
{
  fit <- kmeans(norm_plnts_clst_df, clusters) 
  totalWithnss[clusters] <- fit$tot.withinss
  betweenss[clusters] <- fit$betweenss
}
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
plot(totalWithnss)

plot(betweenss)

plot(totalWithnss/betweenss)

The “totalWithnss”, “betweenss”, and “totalWithnss/betweenss” plots seemed to suggest that k = 20 was a good choice for the number of starting centroids and, therefore, the number of categories present.

# Apply the appropriate number of categories.
fit <- kmeans(norm_plnts_clst_df, 20)
plnts_clst_df$Category <- fit$cluster
norm_plnts_clst_df$Category <- fit$cluster
plnts_clst_df$Category <- as.factor(plnts_clst_df$Category)
norm_plnts_clst_df$Category <- as.factor(norm_plnts_clst_df$Category)

Now that all observations were associated with one of twenty categories, according to k-means, we tried to understand which features are responsible for the most differentiation between the categories present.

# create a daraframe that excludes the categories
pca_plnts_clst_df <- plnts_clst_df[1:50,1:6]
pca_model <- prcomp(pca_plnts_clst_df, center = TRUE, scale. = TRUE)
biplot(pca_model)

The principal component graph suggested that koi_teq and koi_period were very useful for desdcribing differentiation between categories. We also see that koi_prad, koi_imapct, and koi_depth were vectors that essentially caused separation of clusters in the same direction. So perhaps only one of them was needed. The same was said about koi_duraiton and koi_period.

To bolster this analysis (i.e. finding the features most repsonsible for cluster separation), we used a decision tree to reveal the most important features that determined accurate categorizaiton.

We’ll use the “plnts_clst_df” data frame to build the decision tree.

dt_model <- rpart(data = plnts_clst_df, Category ~.)
plotcp(dt_model)

prp(dt_model)

pruned_dt_model <- prune(dt_model, cp = 0.023)
prp(pruned_dt_model)

The tree suggested that koi_impact, koi_teq, and koi_duration were the most important features. These results seemed to agree with the principal component analysis above.

We then created histograms that showed the distributions of categories accross all the features taken for the planets.

ggplot(plnts_clst_df, aes(x = koi_teq, fill = Category)) + geom_histogram(binwidth = 50) + xlim(c(0,4000))
## Warning: Removed 8 rows containing non-finite values (stat_bin).

ggplot(plnts_clst_df, aes(x = koi_impact, fill = Category)) + geom_histogram(binwidth = 0.01)

ggplot(plnts_clst_df, aes(x = koi_prad, fill = Category)) + geom_histogram(binwidth = 0.2) + xlim(c(0,10))
## Warning: Removed 209 rows containing non-finite values (stat_bin).

ggplot(plnts_clst_df, aes(x = koi_depth, fill = Category)) + geom_histogram(binwidth = 20) + xlim(c(0,5000))
## Warning: Removed 151 rows containing non-finite values (stat_bin).

ggplot(plnts_clst_df, aes(x = koi_duration, fill = Category)) + geom_histogram(binwidth = 0.2) + xlim(c(0,20))
## Warning: Removed 14 rows containing non-finite values (stat_bin).

ggplot(plnts_clst_df, aes(x = koi_period, fill = Category)) + geom_histogram(binwidth = 1) + xlim(c(0,100))
## Warning: Removed 231 rows containing non-finite values (stat_bin).

Scanning the histograms of koi_teq, koi_impact, and koi_duration showed certain ranges in which some categories were prevalent and others are not. That is to say, there were bounds of separation between categories within these features which seemed to be a pretty good foundation for a feature space.

We created some scatter plots to enhance the visual analysis.

ggplot(plnts_clst_df, aes(x = koi_impact, y = koi_teq, color = Category)) + geom_point() + ylim(c(0,2000)) + xlim(c(0,1.5))
## Warning: Removed 91 rows containing missing values (geom_point).

ggplot(plnts_clst_df, aes(x = koi_impact, y = koi_duration, color = Category)) + geom_point() + ylim(c(0,10)) + xlim(c(0,1.5))
## Warning: Removed 171 rows containing missing values (geom_point).

ggplot(plnts_clst_df, aes(x = koi_teq, y = koi_duration, color = Category)) + geom_point() + ylim(c(0,10)) + xlim(c(0,4000))
## Warning: Removed 178 rows containing missing values (geom_point).

Since we had three principal features constituting the feature space, we tried a 3D plot.

plot_3d_plnts_clst_df <- subset(plnts_clst_df, koi_teq <= 2000 & koi_impact >= 0 & koi_impact <= 1.5 & koi_duration <= 20)
plot_ly(plot_3d_plnts_clst_df, x = ~koi_impact, y = ~koi_teq, z = ~koi_duration, color = ~Category) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Sky-Projected Distance'),
                     yaxis = list(title = 'Effective Temperature (K)'),
                     zaxis = list(title = 'Duration of Transit (Hours)')))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

The three-dimension plot showed the clusters much more clearly than any of the 2D feature spaces built above. Although the clustering was not perfect due to the fact we chose to ignore several features, the 3D plot did confirm the usefullness of using koi_impact, koi_teq, and koi_duration to explain most clustering between points.

The final step in this clustering analysis was to segment the categorized planet dataset according to categories, then look at the feature distributions for koi_impact, koi_teq, and koi_duration for each category.

# Subset dataframe on a categorical basis
#for (i in 1:20) {
#  temp_df2 <- subset(plnts_clst_df, Category == i)
#  for (j in 1:(dim(plnts_clst_df)[2]-1)) {
#    result.mean <- mean(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
#    result.median <- median(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
#    result.sd <- sd(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
#    result.max <- max(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
#    result.min <- min(plnts_clst_df[,names(plnts_clst_df)[j]], na.rm = TRUE)
    
    # Print the results
#    print(paste("The results for",names(plnts_clst_df)[j],"of category",i,"are..."))
#    print(paste("The mean:",result.mean))
#   print(paste("The median:",result.median))
#    print(paste("The std. dev.:",result.sd))
#    print(paste("The max:",result.max))
#    print(paste("The min:",result.min))
#    print("* * *")
#  }
#}

We found it helpful to create the following plots.

for (i in 1:(dim(plnts_clst_df)[2]-1)){
  plot <- ggplot(plnts_clst_df, aes(x = plnts_clst_df[i], y = Category, color = Category)) + geom_point() + xlab(names(plnts_clst_df)[i]) +  theme(legend.position="none")
  print(plot)
}
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

It was difficult to make comments on the categorization scheme developed via k-means without thorough research being conducted on the features themselves and how they may be related to planet characteristics not explicitly found in the data. We should also note that one of the principal features – koi_impact – was not understood very well. So what this measure reveals about more tangible planet characteristics is still a mystery.

We’ll conclude the unsupervised learning and analysis here and leave the in-depth research on the categories above for a later time.

Conclusion

After conducting EDA, supervised and unsupervised learning on the Kepler Dataset, we were able to learn a lot about this unfamiliar topic and formulate many conclusions. To reiterate, the questions we asked and key takeaways are shown below:

EDA

  1. Are binary stars more likely to host planets?
  2. What are the feature distributions of likely habitable planets?
  3. What does sky-projected distance represent?
  4. What do the stars of Earth-like planets look like, and how do they compare to our sun?
  5. Do the Earth-like planets congregate within certain patches of the night sky?

When conductory an exploratory data anaylsis, we learned from Q1 that binary stars have a much smaller proportion of likely planets encircling them than do single stars, and this is seen in both the Kepler analysis labels and literature labels. From Q2, it appears that all the Earth-like planets in the dataset are considerably warmer than the Earth. From Q3, we noticed that sky-projected distance is very weakly, and essentially uncorrelated, with any other features of interest describing either the planet or the host star. From Q4, we realized that most of the stars that host Earth-like planets seem to be smaller, cooler, and have larger surface accelerations when comapred to our Sun. Last in EDA, we noticed that the observations corresponding to Earth-like planets are spread out accross the patches of celestial coordinates observed by Kepler.

Supervised Learning

  1. Can we determine the classification system for exoplanet candidates (koi_disposition)?

For this question we conducted the following data science models and the misclassification rate was calculated for each model as follows:
- Decision Tree: 12% - Randomforest: 10% - KNN: 23% - SVM: 22% - Neural Network: 78%

The RandomForest model was clearly the most accurate model with the lowest misclassification rate of 10%. From the decision tree, we noticed that the most important factors for determining the classification of a planet were the false positive flags. Hopefully, with this 10% classification rate, this data science model could save time and resources for NASA when they are confirming planets.

Unsupervised Learning

  1. K-Means Clustering: Planet Categorization

Through various means of clustering and analysis, we noticed a few strong characteristics when categorizing planets: koi_impact, koi_teq, and koi_duration. These variables were very useful for describing differentiation between categories. One major challenge we faced was being unfamiliar with this topic and our lack of knowledge limited our analysis especially in unsupervised learning. Without further research on planetary features, it was hard to draw conclusions. This is a key focus moving forward in order to enhance our unsupervised learning analysis.

Challenges Faced & Next Steps

It was a great challenge to work with a dataset without any prior knowledge of the characteristis and features of a planet. However, through the power of data science, we were able to discover hidden relationships and the most important factors when answering our questions for supervised and unsupervised learning. Moving forward, in order to improve our data model, we will focus on learning more about each feature and talk to an expert in this particular field to enhance data quality, recognize errors in the data, better data analysis and discover more relevant relationships & planet characteristics in unsuperised learning.

As well, when we were predicting the classification of KOI’s disposition, it was complex to perform a logistic regression since there were three categories. This is a model that will need to be explored and completed in the future.

We also faced problems with neural networks. We inserted too many inputs into the network, which led to many unneccesary layers that confused the system leading to an incredibly high misclassification rate. This is something that will need to be simplified and explored further in order to grasp the great capabilities of neural networks.